Detox: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

# Detox: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

Shashank Rajput
rajput3@wisc.edu
&Hongyi Wang11footnotemark: 1
hongyiwang@cs.wisc.edu
&Zachary Charles
zcharles@wisc.edu
&Dimitris Papailiopoulos
dimitris@papail.io
Authors contributed equally to this paper and are listed alphabetically.
###### Abstract

To improve the resilience of distributed training to worst-case, or Byzantine node failures, several recent approaches have replaced gradient averaging with robust aggregation methods. Such techniques can have high computational costs, often quadratic in the number of compute nodes, and only have limited robustness guarantees. Other methods have instead used redundancy to guarantee robustness, but can only tolerate limited number of Byzantine failures. In this work, we present Detox, a Byzantine-resilient distributed training framework that combines algorithmic redundancy with robust aggregation. Detox operates in two steps, a filtering step that uses limited redundancy to significantly reduce the effect of Byzantine nodes, and a hierarchical aggregation step that can be used in tandem with any state-of-the-art robust aggregation method. We show theoretically that this leads to a substantial increase in robustness, and has a per iteration runtime that can be nearly linear in the number of compute nodes. We provide extensive experiments over real distributed setups across a variety of large-scale machine learning tasks, showing that Detox leads to orders of magnitude accuracy and speedup improvements over many state-of-the-art Byzantine-resilient approaches.

Detox: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

Shashank Rajputthanks: Authors contributed equally to this paper and are listed alphabetically. University of Wisconsin-Madison rajput3@wisc.edu Hongyi Wang11footnotemark: 1 University of Wisconsin-Madison hongyiwang@cs.wisc.edu Zachary Charles University of Wisconsin-Madison zcharles@wisc.edu Dimitris Papailiopoulos University of Wisconsin-Madison dimitris@papail.io

\@float

noticebox[b]\end@float

## 1 Introduction

To scale the training of machine learning models, gradient computations can often be distributed across multiple compute nodes. After computing these local gradients, a parameter server then averages them, and updates a global model. As the scale of data and available compute power grows, so does the probability that some compute nodes output unreliable gradients. This can be due to power outages, faulty hardware, or communication failures, or due to security issues, such as the presence of an adversary governing the output of a compute node.

Due to the difficulty in quantifying these different types of errors separately, we often model them as Byzantine failures. Such failures are assumed to be able to result in any output, adversarial or otherwise. Unfortunately, the presence of a single Byzantine compute node can result in arbitrarily bad global models when aggregating gradients via their average [5].

In the context of distributed training, there have generally been two distinct approaches to improve Byzantine robustness. The first replaces the gradient averaging step at the parameter server with a robust aggregation step, such as the geometric median and variants thereof [5, 8, 23, 10, 27, 26]. The second approach instead assigns each node redundant gradients, and uses this redundancy to eliminate the effect of Byzantine failures [7, 12, 28].

Both of the above approaches have their own limitations. For the first, robust aggregators are typically expensive to compute and scale super-linearly (in many cases quadratically [19, 10]) with the number of compute nodes. Moreover, such methods often come with limited theoretical guarantees of Byzantine robustness (e.g., only establishing convergence in the limit, or only guaranteeing that the output of the aggregator has positive inner product with the true gradient [5, 19]) and often require strong assumptions, such as bounds on the dimension of the model being trained. On the other hand, redundancy or coding-theoretic based approaches offer strong guarantees of perfect receovery for the aggregated gradients. However, such approaches, in the worst-case, require each node to compute times more gradients, where is the number of Byzantine machines [7]. This overhead is prohibitive in settings with a large number of Byzantine machines.

Our contributions. In this work, we present Detox, a Byzantine-resilient distributed training framework that first uses computational redundancy to filter out almost all Byzantine gradients, and then performs a hierarchical robust aggregation method. Detox is scalable, flexible, and is designed to be used on top of any robust aggregation method to obtain improved robustness and efficiency. A high-level description of the hierarchical nature of Detox is given in Fig. 2.

Detox proceeds in three steps. First the parameter server orders the compute nodes in groups of to compute the same gradients. While this step requires redundant computation at the node level, it will eventually allow for much faster computation at the PS level, as well as improved robustness. After all compute nodes send their gradients to the PS, the PS takes the majority vote of each group of gradients. We show that by setting to be logarithmic in the number of compute nodes, after the majority vote step only a constant number of Byzantine gradients are still present, even if the number of Byzantine nodes is a constant fraction of the total number of compute nodes. Detox then performs hierarchical robust aggregation in two steps: First, it partitions the filtered gradients in a small number of groups, and aggregates them using simple techniques such as averaging. Second, it applies any robust aggregator (e.g., geometric median [8, 26], Bulyan [19], Multi-Krum [10], etc.) to the averaged gradients to further minimize the effect of any remaining traces of the original Byzantine gradients.

We prove that Detox can obtain orders of magnitude improved robustness guarantees compared to its competitors, and can achieve this at a nearly linear complexity in the number of compute nodes , unlike methods like Bulyan [19] that require run-time that is quadratic in . We extensively test our method in real distributed setups and large-scale settings, showing that by combining Detox with previously proposed Byzantine robust methods, such as Multi-Krum, Bulyan, and coordinate-wise median, we increase the robustness and reduce the overall runtime of the algorithm. Moreover, we show that under strong Byzantine attacks, Detox can lead to almost a 40% increase in accuracy over vanilla implementations of Byzantine-robust aggregation. A brief performance comparison with some of the current state-of-the-art aggregators in shown in Fig. 2.

#### Related work.

The topic of Byzantine fault tolerance has been extensively studied since the early 80s by Lamport et al. [16], and deals with worst-case, and/or adversarial failures, e.g., system crashes, power outages, software bugs, and adversarial agents that exploit security flaws. In the context of distributed optimization, these failures are manifested through a subset of compute nodes returning to the master flawed or adversarial updates. It is now well understood that first-order methods, such as gradient descent or mini-batch SGD, are not robust to Byzantine errors; even a single erroneous update can introduce arbitrary errors to the optimization variables.

Byzantine-tolerant ML has been extensively studied in recent years [13, 24, 25, 14, 4, 8], establishing that while average-based gradient methods are susceptible to adversarial nodes, median-based update methods can in some cases achieve better convergence, while being robust to some attacks. Although theoretical guarantees are provided in many works, the proposed algorithms in many cases only ensure a weak form of resilience against Byzantine failures, and often fail against strong Byzantine attacks [19]. A stronger form of Byzantine resilience is desirable for most of distributed machine learning applications. To the best of our knowledge, Draco [7] and Bulyan [19] are the only proposed methods that guarantee strong Byzantine resilience. However, as mentioned above, Draco requires heavy redundant computation from the compute nodes, while Bulyan requires heavy computation overhead on the parameter server end.

We note that [1] presents an alternative approach that does not fit easily under either category, but requires convexity of the underlying loss function. Finally, [3] examines the robustness of signSGD with a majority vote aggregation, but study a restricted Byzantine failure setup that only allows for a blind multiplicative adversary.

## 2 Problem Setup

Our goal is to solve solve the following empirical risk minimization problem:

 minwF(w):=1nn∑i=1fi(w)

where denotes the parameters of a model, and is the loss function on the -th training sample. To approximately solve this problem, we often use mini-batch SGD. First, we initialize at some . At iteration , we sample uniformly at random from , and then update via

 wt+1=wt−ηt|St|∑i∈St∇fi(wt), (1)

where is a randomly selected subset of the data points. To perform mini-batch SGD in a distributed manner, the global model is stored at a parameter server (PS) and updated according to (1), i.e., by using the mean of gradients that are evaluated at the compute nodes.

Let denote the total number of compute nodes. At each iteration , during distributed mini-batch SGD, the PS broadcasts to each compute node. Each compute node is assigned , and then evaluates the sum of gradients

 gi=∑j∈Si,t∇fj(wt).

The PS then updates the global model via

 wt+1=wt−ηtpp∑i=1gi.

We note that in our setup we assume that the parameter server is the owner of the data, and has access to the entire data set of size .

Distributed training with Byzantine nodes  We assume that a fixed subset of size of the compute nodes are Byzantine. Let be the output of node . If is not Byzantine (), we say it is “honest”, in which case its output where is the true sum of gradients assigned to node . If is Byzantine (), its output can be any -dimensional vector. The PS receives , and can then process these vectors to produce some approximation to the true gradient update in (1).

We make no assumptions on the Byzantine outputs. In particular, we allow adversaries with full information about and , and that the byzantine compute nodes can collude. Let be the fraction of Byzantine nodes. We will assume throughout.

## 3 Detox: A Redundancy Framework to Filter most Byzantine Gradients

We now describe Detox, a framework for Byzantine-resilient mini-batch SGD with nodes, of which are Byzantine. Let be the desired batch-size, and let be an odd integer. We refer to as the redundancy ratio. For simplicity, we will assume divides and that divides . Detox can be directly extended to the setting where this does not hold.

Detox first computes a random partition of in node groups each of size . This will be fixed throughout. We then initialize at some . For , we wish to compute some approximation to the gradient update in (1). To do so, we need a Byzantine-robust estimate of the true gradient. Fix , and let us suppress the notation when possible. As in mini-batch SGD, let be a subset of of size , with each element sampled uniformly at random from . We then partition of in groups of size . For each , the PS assigns node the task of computing

 gj:=1|Sj|∑k∈Sj∇fk(w)=prb∑k∈Sj∇fk(w). (2)

If is an honest node, then its output is , while if is Byzantine, it outputs some -dimensional . The are then sent to the PS. The PS then computes

 zj:=maj({^gi|i∈Aj}),

where denotes the majority vote. If there is no majority, we set . We will refer to as the “vote” of group .

Since some of these votes are still Byzantine, we must do some robust aggregation of the vote. We employ a hierarchical robust aggregation process Hier-Aggr, which uses two user-specified aggregation methods and . First, the votes are partitioned in to groups. Let denote the output of on each group. The PS then computes and updates the model via . This hierarchical aggregation resembles a median of means approach on the votes [20], and has the benefit of improved robustness and efficiency. We discuss this in further detail in Section 4.

A description of Detox is given in Algorithm 1.

### 3.1 Filtering out Almost Every Byzantine Node

We now show that Detox filters out the vast majority of Byzantine gradients. Fix the iteration . Recall that all honest nodes in a node group send as in (2) to the PS. If has more honest nodes than Byzantine nodes then and we say is honest. If not, then may not equal in which case is a Byzantine vote. Let be the indicator variable for whether block has more Byzantine nodes than honest nodes, and let . This is the number of Byzantine votes. By filtering, Detox goes from a Byzantine compute node ratioof to a Byzantine vote ratio of where .

We first show that decreases exponentially with , while only decreases linearly with . That is, by incurring a constant factor loss in compute resources, we gain an exponential improvement in the reduction of byzantine nodes. Thus, even small can drastically reduce the Byzantine ratio of votes. This observation will allow us to instead use robust aggregation methods on the , i.e., the votes, greatly improving our Byzantine robustness. We have the following theorem about . All proofs can be found in the appendix. Note that throughout, we did not focus on optimizing constants.

###### Theorem 1.

There is a universal constant such that if the fraction of Byzantine nodes is , then the effective number of Byzantine votes after filtering becomes

 E[^q]=O(ϵ(r−1)/2q/r).

We now wish to use this to derive high probability bounds on . While the variables are not independent, they are negatively correlated. By using a version of Hoeffding’s inequality for weakly dependent variables, we can show that if the redundancy is logarithmic, i.e., , then with high probability the number of effective byzantine votes drops to a constant, i.e., .

###### Corollary 2.

There is a constant such that if and and then for any , with probability at least , we have that .

In the next section, we exploit this dramatic reduction of Byzantine votes to derive strong robustness guarantees for Detox.

## 4 Detox Improves the Speed and Robustness of Robust Estimators

Using the results of the previous section, if we set the redundancy ratio to , the filtering stage of Detox reduces the number of Byzantine votes to roughly a constant. While we could apply some robust aggregator directly to the output votes of the filtering stage, such methods often scale poorly with the number of votes . By instead applying Hier-Aggr, we greatly improve efficiency and robustness. Recall that in Hier-Aggr, we partition the votes into “vote groups”, apply some to each group, and apply some to the outputs of . We analyze the case where is roughly constant, computes the mean of its inputs, and is a robust aggregator. In this case, Hier-Aggr is analogous to the Median of Means (MoM) method from robust statistics [20].

#### Improved speed.

Suppose that without redundancy, the time required for the compute nodes to finish is . Applying Krum [5], Multi-Krum [10], and Bulyan [19] to their outputs requires operations, so their overall runtime is . In Detox, the compute nodes require times more computation to evaluate redundant gradients. If , this can be done in . With Hier-Aggr as above, Detox performs three major operations: (1) majority voting, (2) mean computation of the vote groups and (3) robust aggregation of the these means using . (1) and (2) require time. For practical aggregators, including Multi-Krum and Bulyan, (3) requires time. Since , Detox has runtime . If (which generally holds for gradient computations), Krum, Multi-Krum, and Bulyan require time, but Detox only requires time. Thus, Detox can lead to significant speedups, especially when the number of workers is large.

#### Improved robustness.

To analyze robustness, we first need some distributional assumptions. At any given iteration, let denote the full gradient of . Throughout this section, we assume that the gradient of each sample is drawn from a distribution on with mean and variance . In Detox, the “honest” votes will also have mean , but their variance will be . This is because each honest compute node gets a sample of size , so its variance is reduced by a factor of .

Suppose is some approximation to the true gradient . We say that is a -inexact gradient oracle for if . [27] shows that access to a -inexact gradient oracle is sufficient to upper bound the error of a model produced by performing gradient updates with . To bound the robustness of an aggegator, it suffices to bound . Under the distributional assumptions above, we will derive bounds on for the hierarchical aggregator with different base aggregators .

We will analyze Detox when computes the mean of the vote groups, and is geometric median, coordinate-wise median, or -trimmed mean [26]. We will denote the approximation to computed by Detox in these three instances by and , respectively. Using the proof techniques in [20], we get the following.

###### Theorem 3.

Assume and where is the constant from Corollary 2. There are constants such that for all , with probability at least :

1. If , then is a -inexact gradient oracle.

2. If , then is a -inexact gradient oracle.

3. If and , then is a -inexact gradient oracle.

The above theorem has three important implications. First, we can derive robustness guarantees for Detox that are virtually independent of the Byzantine ratio . Second, even when there are no Byzantine machines, it is known that no aggregator can achieve  [18], and because we achieve , we cannot expect to get an order of better robustness by any other aggregator. Third, other than a logarithmic dependence on , there is no dependence on the number of nodes . Even as and increase, we still maintain roughly the same robustness guarantees.

By comparison, the robustness guarantees of Krum and Geometric Median applied directly to the compute nodes worsens as as increases [4, 23]. Similarly, [26] show if we apply coordinate-wise median to nodes, each of which are assigned gradients, we get a -inexact gradient oracle where . If is constant and is comparable to , then this is roughly , whereas Detox can produce a -inexact gradient oracle for . Thus, the robustness of Detox can scale much better with the number of nodes than naive robust aggregation of gradients.

## 5 Experiments

In this section we present an experimental study on pairing Detox with a set of previously proposed robust aggregation methods, including Multi-krum [4], Bulyan [19], coordinate-wise median [27]. We also incorporate Detox with a recently proposed Byzantine resilience distributed training method, signSGD with majority vote [3]. We conduct extensive experiments on the scalability and robustness of these Byzantine resilient methods, and the improvements gained when pairing them with Detox. All our experiments are deployed on real distributed clusters under various Byzantine attack models. Our implementation is publicly available for reproducibility at .

The main findings are as follows: 1) Applying Detox leads to significant speedups, e.g., up to an order of magnitude end-to-end training speedup is observed; 2) in defending against state-of-the-art Byzantine attacks, Detox leads to significant Byzantine-resilience, e.g., applying Bulyan on top of Detox improves the test-set prediction accuracy from 11% to  60% when training VGG13-BN on CIFAR-100 under the “a little is enough" (ALIE) [2] Byzantine attack. Moreover, incorporating signSGD with Detox improves the test-set prediction accuracy from to when defending against a constatnt Byzantine attck for ResNet-18 trained on CIFAR-10.

### 5.1 Experimental Setup

We implemented vanilla versions of the aforementioned Byzantine resilient methods, as well as versions of these methods pairing with Detox, in PyTorch [21] with MPI4py [9]. Our experimental comparisons are deployed on a cluster of m5.2xlarge instances on Amazon EC2, where 1 node serves as the PS and the remaining nodes are compute nodes. In all following experiments, we set the number of Byzantine nodes to be .

In each iteration of the vanilla Byzantine resilient methods, each compute node evaluates gradients sampled from its partition of data while in Detox each compute node evaluates times more gradients where , so . The average of these locally computed gradients is then sent back to the PS. After receiving all gradient summations from the compute nodes, the PS applies either vanilla Byzantine resilient methods or their Detox paired variants.

### 5.2 Implementation of Detox

We emphasize that Detox is not simply a new robust aggregation technique. It is instead a general Byzantine-resilient distributed training framework, and any robust aggregation method can be immediately implemented on top of it to increase its Byzantine-resilience and scalability. Note that after the majority voting stage on the PS one has a wide range of choices for and . In our implementations, we had the following setups: 1) Mean, Coordinate-size Median, 2) Multi-krum, Mean, 3) Bulyan, Mean, and 4) coordinate-wise majority vote, coordinate-wise majority vote (designed specifically for pairing Detox with signSGD). We tried Mean and Multi-krum/Bulyan but we found that setups 2) and 3) had better resilience than these choices. More details on the implementation and system-level optimizations that we performed can be found in the Appendix B.1.

#### Byzantine attack models

We consider two Byzantine attack models for pairing Multi-krum, Bulyan, and coordinate-wise median with Detox. First, we consider the “reversed gradient" attack, where adversarial nodes that were supposed to send to the PS instead send , for some .

The second Byzantine attack model we study is the recently proposed ALIE [2] attack, where the Byzantine compute nodes collude and use their locally calculated gradients to estimate the mean and standard deviation of the entire set of gradients among all other compute nodes. The Byzantine nodes then use the estimated mean and variance to manipulate the gradient they send back to the PS. To be more specific, Byzantine nodes will send where and are the estimated mean and standard deviation by Byzantine nodes and is a hyper-parameter which was tuned empirically in [2].

Then, to compare the resilience of the vanilla signSGD and the one paired with Detox, we will consider a simple attack, i.e., constant Byzantine attack. In constant Byzantine attack, Byzantine compute nodes simply send a constant gradient matrix with dimension equal to that of the true gradient where all elements equals to . Under this attack, and specifically for signSGD, the Byzantine gradients will mislead model updates towards wrong directions and corrupt the final model trained via signSGD.

#### Datasets and models

We conducted our experiments over ResNet-18 [15] on CIFAR-10 and VGG13-BN [22] on CIFAR-100. For each dataset, we use data augmentation (random crops, and flips) and normalize each individual image. Moreover, we tune the learning rate scheduling process and use the constant momentum at in running all experiments. The details of parameter tuning and dataset normalization are reported in the Appendix B.2.

### 5.3 Results

#### Scalability

We report a per-iteration runtime analysis of the aforementioned robust aggregations and their Detox paired variants on both CIFAR-10 over ResNet-18 and CIFAR-100 over VGG-13. The results on ResNet-18 and VGG13-BN are shown in Figure 2 and 3 respectively.

We observe that although Detox requires slightly more compute time per iteration, due to its algorithmic redundancy, it largely reduces the PS computation cost during the aggregation stage, which matches our theoretical analysis. Surprisingly, we observe that by applying Detox, the communication costs decrease. This is because the variance of computation time among compute nodes increases with heavier computational redundancy. Therefore, after applying Detox, compute nodes tend not to send their gradients to the PS at the same time, which mitigates a potential network bandwidth congestion. In a nutshell, applying Detox can lead to up to 3 per-iteration speedup.

#### Byzatine-resilience under various attacks

We first study the Byzantine-resilience of all methods and baselines under the ALIE attack, which is to the best of our knowledge, the strongest Byzantine attack known. The results on ResNet-18 and VGG13-BN are shown in Figure 2 and 3 respectively. Applying Detox leads to significant improvement on Byzantine-resilience compared to vanilla Multi-krum, Bulyan, and coordinate-wise median on both datasets as shown in Table 1.

We then consider the reverse gradient attack, the results are shown in Figure 4. Since reverse gradient is a much weaker attack, all vanilla robust aggregation methods and their Detox paired variants defend well.

Moreover, applying Detox leads to significant end-to-end speedups. In particular, combining the coordinate-wise median with Detox led to a speedup gain in the amount of time to achieve to 90% test set prediction accuracy for ResNet-18 trained on CIFAR-10. The speedup results are shown in Figure 5. For the experiment where VGG13-BN was trained on CIFAR-100, up to an order of magnitude end-to-end speedup can be observed in coordinate-wise median applied on top of Detox.

For completeness, we also compare versions of Detox with Draco [7]. This is not the focus of this work, as we are primarily interested in showing that Detox improves the robustness of traditional robust aggregators. However the comparisons with Draco can be found in the Appendix B.4.

#### Comparison between Detox and signSgd

The results of both ResNet-18 trained on CIFAR-10 and VGG13-BN trained on CIFAR-100 are shown in Figure 6 where we observe that Detox paired signSGD improves the Byzantine resilience of signSGD significantly. For ResNet-18 trained on CIFAR-10, Detox improves testset prediction accuracy of vanilla signSGD from to . While for VGG13-BN trained on CIFAR-100, Detox improves testset prediction accuracy (TOP-1) of vanilla signSGD from to .

#### Mean estimation on synthetic data

To verify our theoretical analysis, we finally conduct an experiment for a simple mean estimation task. The result of our synthetic mean experiment are shown in Figure 7. In the synthetic mean experiment, we set , and for dimension , we generate samples iid from . The Byzantine nodes, instead send a constant vector of the same dimension with norm of 100. The robustness of an estimator is reflected in the norm of its mean estimate. Our experimental results show that Detox increases the robustness of geometric median and coordinate-wise median, and decreases the dependecne of the error on .

## 6 Conclusion

In this paper, we present Detox, a new framework for Byzantine-resilient distributed training. Notably, any robust aggregator can be immediatley used with Detox to increase its robustness and efficiency. We demonstrate these improvements theoretically and empirically. In the future, we would like to devise a privacy-preserving version of Detox, as currently it requires the PS to be the owner of the data, and also to partition data among compute nodes. This means that the current version of Detox is not privacy preserving. Overcoming this limitation would allow us to develop variants of Detox for federated learning.

## References

• [1] D. Alistarh, Z. Allen-Zhu and J. Li (2018) Byzantine stochastic gradient descent. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett (Eds.), pp. 4618–4628. External Links: Link Cited by: §1.
• [2] M. Baruch, G. Baruch and Y. Goldberg (2019) A little is enough: circumventing defenses for distributed learning. arXiv preprint arXiv:1902.06156. Cited by: Figure 2, §5.2, Table 1, §5.
• [3] J. Bernstein, J. Zhao, K. Azizzadenesheli and A. Anandkumar (2018) SignSGD with majority vote is communication efficient and fault tolerant. arXiv. Cited by: §1, §5.3, §5.
• [4] P. Blanchard, R. Guerraoui and J. Stainer (2017) Machine learning with adversaries: byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, pp. 119–129. Cited by: §1, §4, §5.
• [5] P. Blanchard, E. M. E. Mhamdi, R. Guerraoui and J. Stainer (2017) Machine learning with adversaries: byzantine tolerant gradient descent. See DBLP:conf/nips/2017, pp. 118–128. External Links: Link Cited by: 2nd item, §1, §1, §1, §4.
• [6] J. R. C. Pelekis (2017) Hoeffding’s inequality for sums of weakly dependent random variables. Mediterranean Journal of Mathematics. Cited by: §A.3.
• [7] L. Chen, H. Wang, Z. Charles and D. Papailiopoulos (2018) DRACO: byzantine-resilient distributed training via redundant gradients. In International Conference on Machine Learning, pp. 902–911. Cited by: §1, §1, §1, §5.3.
• [8] Y. Chen, L. Su and J. Xu (2017) Distributed statistical machine learning in adversarial settings: byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems 1 (2), pp. 44. Cited by: §1, §1, §1.
• [9] L. D. Dalcin, R. R. Paz, P. A. Kler and A. Cosimo (2011) Parallel distributed computing using python. Advances in Water Resources 34 (9), pp. 1124–1139. Cited by: §5.1.
• [10] G. Damaskinos, E. M. El Mhamdi, R. Guerraoui and S. Guirguis (2019) AGGREGATHOR: byzantine machine learning via robust gradient aggregation. Conference on Systems and Machine Learning. Cited by: §1, §1, §1, §4.
• [11] G. Damaskinos, E. M. E. Mhamdi, R. Guerraoui, A. Guirguis and S. Rouault (2019) AggregaThor: byzantine machine learning via robust gradient aggregation. In SysML, Cited by: §B.1.
• [12] D. Data, L. Song and S. Diggavi (2018) Data encoding for byzantine-resilient distributed gradient descent. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 863–870. Cited by: §1.
• [13] E. El-Mhamdi, R. Guerraoui, A. Guirguis and S. Rouault (2019) SGD: decentralized byzantine resilience. arXiv preprint arXiv:1905.03853. Cited by: §1.
• [14] E. El-Mhamdi and R. Guerraoui (2019) Fast and secure distributed learning in high dimension. arXiv preprint arXiv:1905.04374. Cited by: §1.
• [15] K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.2.
• [16] L. Lamport, R. Shostak and M. Pease (1982) The byzantine generals problem. ACM Transactions on Programming Languages and Systems (TOPLAS) 4 (3), pp. 382–401. Cited by: §1.
• [17] N. Linial and Z. Luria (2014) Chernoff’s inequality-a very elementary proof. arXiv preprint arXiv:1403.7739. Cited by: §A.3, Theorem.
• [18] G. Lugosi and S. Mendelson (2019) Sub-gaussian estimators of the mean of a random vector. The Annals of Statistics 47 (2), pp. 783–794. Cited by: §4.
• [19] E. M. E. Mhamdi, R. Guerraoui and S. Rouault (2018) The hidden vulnerability of distributed learning in byzantium. arXiv preprint arXiv:1802.07927. Cited by: 1st item, §1, §1, §1, §1, §4, §5.
• [20] S. Minsker (2015) Geometric median and robust estimation in banach spaces. Bernoulli 21 (4), pp. 2308–2335. Cited by: §A.4, §A.4, §3, §4, §4, Lemma 6.
• [21] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §5.1.
• [22] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §5.2.
• [23] C. Xie, O. Koyejo and I. Gupta (2018) Generalized byzantine-tolerant sgd. arXiv preprint arXiv:1802.10116. Cited by: §1, §4.
• [24] C. Xie, O. Koyejo and I. Gupta (2018) Zeno: byzantine-suspicious stochastic gradient descent. arXiv preprint arXiv:1805.10032. Cited by: §1.
• [25] C. Xie, S. Koyejo and I. Gupta (2019) Fall of empires: breaking byzantine-tolerant sgd by inner product manipulation. arXiv preprint arXiv:1903.03936. Cited by: §1.
• [26] D. Yin, Y. Chen, K. Ramchandran and P. Bartlett (2018) Byzantine-robust distributed learning: towards optimal statistical rates. In International Conference on Machine Learning, pp. 5636–5645. Cited by: §1, §1, §4, §4.
• [27] D. Yin, Y. Chen, K. Ramchandran and P. Bartlett (2018) Defending against saddle point attack in byzantine-robust distributed learning. CoRR abs/1806.05358. External Links: Link, 1806.05358 Cited by: §1, §4, §5.
• [28] Q. Yu, N. Raviv, J. So and A. S. Avestimehr (2018) Lagrange coded computing: optimal design for resiliency, security and privacy. arXiv preprint arXiv:1806.00939. Cited by: §1.

## Appendix A Proofs

### a.1 Proof of Theorem 1

The following is a more precise statement of the theorem.

###### Theorem.

If , and then falls as which is exponential in r.

###### Proof.

By direct computation,

 E(^q) =E⎛⎝p/r∑i=1Xi⎞⎠ =prE(Xi) =pr(r−1)/2∑i=0(qr−i)(p−qi)(pr) ≤prr+12(q(r+1)/2)(p−q(r−1)/2)(pr) ≤prr+12(r(r−1)/2)q(r+1)/2(p−q)(r−1)/2(p−r)r =prr+12(r(r−1)/2)q(r+1)/2(p−q)(r−1)/2pr(1−r/p)r ≤prr+12(r(r−1)/2)q(r+1)/2(p−q)(r−1)/2pr(1/2)r =pr(r+1)2r−1(r(r−1)/2)ϵ(r+1)/2(1−ϵ)(r−1)/2.

Note that is the coefficient of in the binomial expansion of . Therefore, setting , we find that . Therefore,

 pr(r+1)2r−1(r(r−1)/2)ϵ(r+1)/2(1−ϵ)(r−1)/2 ≤pr(r+1)22r−1ϵ(r+1)/2(1−ϵ)(r−1)/2 =pr(r+1)ϵ(22r−1ϵ(r−1)/2(1−ϵ)(r−1)/2) =2qr(r+1)(16ϵ(1−ϵ))(r−1)/2 .

Note that since and is odd, we have . Therefore,

 E(^q)≤2q(40ϵ(1−ϵ))(r−1)/2/r.

For , we have the following lemma.

If , then when .

###### Proof.
 E(qe) =E(p3∑i=1Xi)=p3E(Xi)=p3(q3)+(q2)(p−q1)(n3) =p3q(q−1)(3p−2q−2)p(p−1)(p−2)=q3(ϵ−1p)(3−2δ−2p)(1−1p)(1−2p) ≤q3ϵ3−2ϵ−2p1−2p≤qϵ(4−2ϵ)/3

### a.2 Proof of Corollary 2

From Theorem 1 we see that . Now, straightforward analysis implies that if and then . We will then use the following Lemma:

###### Lemma 5.

For all ,

 P[^q≥E[^q](1+θ)]≤(11+θ/2)E[^q]θ/2

Now, using Lemma 5 and assuming ,

 P[^q≥E[^q](1+θ)]≤(11+θ/2)E[^q]θ/2 ⟹P[^q≥1+E[^q]θ]≤(11+θ/2)E[^q]θ/2 ⟹P[^q≥1+E[^q]θ]≤2−E[^q]θ/2

where we used the fact that in the first implication and the assumption that in the second. Setting , we get the probability bound. Finally, setting makes , which completes the proof.

### a.3 Proof of Lemma 5

We will prove the following:

 P[^q≥E[^q](1+θ)]≤⎛⎜ ⎜ ⎜⎝11+θ2⎞⎟ ⎟ ⎟⎠E[^q]θ/2
###### Proof.

We will use the following theorem for this proof [17, 6].

###### Theorem (Linial [17]).

Let be Bernoulli random variables. Let be such that is a positive integer and let be any positive integer such that . Then

 P[^p∑i=1Xi≥β^p]≤1(β^pk)∑|A|=kP[∧i∈A(Xi=1)]

Let . Now, . We will show that

 P[∧i∈A(Xi=1)]≤(E[^q]/^p)k

where of size . To see this, note that for any , . The conditional probability of some other being given that is would only reduce. Formally, for ,

 P[Xj=1|Xi=1]≤P[Xi=1]=ϵγ.

Note that for to be , the Byzantine machines in the -th block must be in the majority. Hence, the reduction in the pool of leftover Byzantine machines was more than honest machines. Since the total number of Byzantine machines is less than the number of honest machines, the probability for them being in a majority in block reduces. Therefore,

 P[^p∑i=1Xi≥E[^q](1+θ)] ≤(^pk)(E[^q](1+θ)k)P[∧i∈A(Xi=1)] ≤(^pk)(E[^q](1+θ)k)(E[^q]/^p)k ≤(^p)kk!(E[^q](1+θ)k)(E[^q]^p)k .

Letting , we then have

 P[^p∑i=1Xi≥E[^q](1+θ)] ≤(^p)k(E[^q](1+θ/2))k(E[^q]/^p)k =⎛⎝11+θ2⎞⎠E[^q]θ/2

### a.4 Proof of Theorem 3

We will adapt the techniques of Theorem 3.1 in [20].

###### Lemma 6 ([20], Lemma 2).

Let be some Hilbert space, and for , let be their geometric median. Fix and suppose that satisfies , where

 Cα=(1−α)√11−2α

and . Then there exists with such that for all , .

Note that for a general Hilbert or Banach space , the geometric median is defined as:

 xgm:=argmink∑j=1∥x−xj∥H

where is the norm on . This coincides with the notion of geometric median in under the norm. Note that Coordinatewise Median is the Geometric Median in the real space with the norm, which forms a Banach space.

Firstly, we use Corollary 2 to see that with probability , . Now, we assume that is true. We will show the remainder of the theorem holds with probability at least , as then a union bound will give us the desired result.

(1): Let us assume that number of clusters is for some , also note that . Now, choose . Choose . Assume that the Geometric Median is more than distance away from true mean. Then by the previous Lemma, atleast fraction of the empirical means of the clusters must lie atleast distance away from true mean. Because we assume the number of clusters is more than , atleast fraction of empirical means of uncorrupted clusters must also lie atleast distance away from true mean.

Recall that the variance of the mean of an “honest” vote group is given by

 (σ′)2=σ2kb.

By applying Chebyshev’s inequality to the uncorrupted vote group , we find that its empirical mean satisfies

 P(∥G[i]−G∥≥4σ√kb)≤116.

Now, we define a Bernoulli event that is 1 if the empirical mean of an uncorrupted vote group is at distance larger than to the true mean, and 0 otherwise. By the computation above, the probability of this event is less than . Thus, its mean is less than and we want to upper bound the probability that empirical mean is more than . Using the number of events as , we find that this holds with probability at least . For this, we used the following version of Hoeffding’s inequality in this part and part (3) of this proof. For Bernoulli events with mean , empirical mean , number of events and deviation :

 P(^μ−μ≥θ)≤exp(−2mθ2)

To finish the proof, just plug in the values of given in the Lemma 2.1 (written above) from [20], where for Geometric Median.

(2): For coordinate-wise median, we set . Then we apply the result proved in previous part for each dimension of . Then, we get that with probability at least ,

 |^Gi−Gi|≤C1σi√logd/δb

where is the coordinate of , is the coordinate of and is the diagonal entry of . Doing a union bound, we get that with probability at least

 ∥^G−G∥≤C1σ√logd/δb.

(3): Define

 Δi=σi  ⎷kb√12klogdδ

where is the diagonal entry of . Now, for each uncorrupted vote group, using Chebyshev’s inequality:

 P(|^Gi−Gi|≥Δi)≤√12klogdδ.

Now, coordinate of -trimmed mean lies away from if atleast of the coordinates of vote group empirical means lie away from . Note that because of the assumption of the Proposition . Because of these can be corrupted, atleast of true empirical means have coordinates that lie away from . This means fraction have true empirical means have coordinates that lie away from . Define a Bernoulli variable for a vote group as being 1 if the coordinate of empirical mean of that vote group lies more than away from , and 0 otherwise.

The mean of therefore satisfies

 E(X)<√12klogdδ.

Set

 α=4√12klogdδ.

Again, using Hoeffding’s inequality in a manner analogous to part (1) of the proof, we get that probability of coordinate of -trimmed mean being more than away from is less than .

Taking union bound over all coordinates, we find that the probability of -trimmed mean being more than

 σ  ⎷kb√12klogdδ=σ√4kbα

away from is less than . Hence we have proved that if

 α=4√12klogdδ

and , then with probability at least , . Now, set and . One can easily see that is satisfied and we get that with probability at least , for some constant ,

 Δ≤C3σ√log(d/δ)b.

## Appendix B Extra Experimental Details

### b.1 Implementation and system-level optimization details

We introduce the details of combining Bulyan, Multi-krum, and coordinate-wise median with Detox.

• Bulyan: according to [19] Bulyan requires . In Detox, after the first majority voting level, the corresponding requirement in Bulyan becomes . Thus, we assign all “winning" gradients in to one cluster i.e., Bulyan is conducted across 15 gradients.

• Multi-krum: according to [5], Multi-krum requires . Therefore, for similar reason, we assign 15 “winning" gradients into two groups with uneven sizes at 7 and 8 respectively.

• coordinate-wise median: for this baseline we follow the theoretical analysis in Section 3.1 i.e., 15 “winning" gradients are evenly assigned to 5 clusters with size at 3 for reverse gradient Byzantine attack. For ALIE attack, we assign those 15 gradients evenly to 3 clusters with size of 5. The reason for this choice is simply that we observe the reported strategies perform better in our experiments. Then mean of the gradients is calculated in each cluster. Finally, we take coordinate-wise median across means of all clusters.

One important thing to point out is that we conducted system level optimizations on implementing Multi-krum and Bulyan, e.g., parallelizing the computationally heavy parts in order to make the comparisons more fair according to [11]. The main idea of our system-level optimization are two-fold: i) gradients of all layers of a neural network are firstly vectorized and concatenated to a high dimensional vector. Robust aggregations are then deployed on those high dimensional gradient vectors from all compute nodes. ii) As computational heavy parts exist for several methods e.g., calculating medians in the second stage of Bulyan. To optimize that part, we chunk the high dimensional gradient vectors evenly into pieces, and parallelize the median calculations in all the pieces. Our system-level optimization leads to 2-4 speedup in the robust aggregation stage.

### b.3 Data augmentation and normalization details

In preprocessing the images in CIFAR-10/100 datasets, we follow the standard data augmentation and normalization process. For data augmentation, random cropping and horizontal random flipping are used. Each color channels are normalized with mean and standard deviation by , . Each channel pixel is normalized by subtracting the mean value in this color channel and then divided by the standard deviation of this color channel.

### b.4 Comparison between Detox and Draco

We provide the experimental results in comparing Detox with Draco.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters