Phocas: dimensional Byzantine-resilient stochastic gradient descent

Phocas: dimensional Byzantine-resilient stochastic gradient descent

Cong Xie
Department of Computer Science
University of Illinois at Urbana Champaign
cx2@illinois.edu
&Oluwasanmi Koyejo
Department of Computer Science
University of Illinois at Urbana Champaign
cx2@illinois.edu
&Indranil Gupta
Department of Computer Science
University of Illinois at Urbana Champaign
cx2@illinois.edu
Abstract

We propose a novel robust aggregation rule for distributed synchronous Stochastic Gradient Descent (SGD) under a general Byzantine failure model. The attackers can arbitrarily manipulate the data transferred between the servers and the workers in the parameter server (PS) architecture. We prove the Byzantine resilience of the proposed aggregation rules. Empirical analysis shows that the proposed techniques outperform current approaches for realistic use cases and Byzantine attack scenarios.

1 Introduction

The failure resilience of distributed machine-learning systems has attracted increasing attention (Blanchard et al., 2017; Chen et al., 2017; Yin et al., 2018; Alistarh et al., 2018) in the community. Larger clusters can accelerate training. However, this makes the distributed system more vulnerable to different kinds of failures or even attacks (Harinath et al., 2017). Thus, failure/attack resilience is becoming more and more important for distributed machine-learning systems, especially for large-scale deep learning (Dean et al., 2012; McMahan et al., 2017).

In this paper, we consider the most general failure model, Byzantine failures (Lamport et al., 1982), where the attackers can know any information of the other processes, and attack any value in transmission. To be more specific, the data transmission between the machines can be replaced by arbitrary values. Under such model, there are no constraints on the failures or attackers.

The distributed training framework studied in this paper is the Parameter Server (PS). The PS architecture is composed of the server nodes and the worker nodes. The server nodes maintain a global copy of the model, aggregate the gradients from the workers, apply the gradients to the model, and broadcast the latest model to the workers. The worker nodes pull the latest model from the server nodes, compute the gradients according to the local portion of the training data, and send the gradients to the server nodes. The entire dataset and the corresponding workload is distributed to multiple worker nodes, thus parallelizing the computation via partitioning the dataset.

In this paper, we study the Byzantine resilience of synchronous Stochastic Gradient Descent (SGD), which is a popular class of learning algorithms using PS architecture. Its variants are widely used in training deep neural networks Kingma and Ba (2014); Mukkamala and Hein (2017). Such algorithms always wait to collect gradients from all the worker nodes before moving on to the next iteration.

(a) Classic Byzantine
(b) Generalized Byzantine
Figure 1: The 2 figures visualize workers with -dimensional gradients. The th row represents the gradient vector produced by the th worker. The th column represents the th dimension of the gradients. A shadow block represents that the corresponding value is replaced by a Byzantine value. In the two examples, the maximal number of Byzantine values for each dimension is . For the classic Byzantine model, all the Byzantine values must lie in the same workers (rows), while for the generalized Byzantine model there is no such constraint. Thus, (a) is a special case of (b).

The failure model can be described by using an matrix consisting of the -dimensional gradients produced by workers, as visualized in Figure 1. Previous work Blanchard et al. (2017) has so far addressed a special case, where the Byzantine values must lie in the same rows (workers) as shown in Figure 1(a). Our failure model generalizes the classic Byzantine failure model by placing the Byzantine values anywhere in the matrix without any constraint. For example, Lee et al. (2017) describes the vulnerability to bit-flipping attacks of a wireless transmission technology. The servers could receive data via such vulnerable communication media, even if the messages are encrypted. As a result, an arbitrary fraction of the received values are Byzantine.

There are two limitations lying in most of the existing Byzantine-resilient SGD algorithms Blanchard et al. (2017); Chen et al. (2017). First, they only consider the classic Byzantine model shown in Figure 1(a). However, the Byzantine failures can also happen in the communication media/interfaces on the server side, which yields the generalized Byzantine model shown in Figure 1(b). Second, the algorithms are based on the Euclidean norm, which suffers from the curse of dimensionality. When the dimension gets higher, it will be more difficult to distinguish the Byzantine gradients from the correct ones.

In this paper, we study the dimensional Byzantine-resilient algorithms, which tolerate the generalized Byzantine model under certain conditions, and are not affected by the curse of dimensionality. We propose Byzantine-resilient trimmed-mean-based aggregation rules. We assume that for each dimension, the number of Byzantine values must be less than the number of correct ones. The resilience to such Byzantine model is called “dimensional Byzantine resilience". The main contributions of this paper are listed below:

  • We formulate the dimensional Byzantine resilience property, and prove that the proposed trimmed-mean-based approaches are dimensional Byzantine-resilient (Definition 5). As far as we know, this paper is the first one to study generalized Byzantine failures and dimensional Byzantine resilience for synchronous SGD.

  • We show that the proposed aggregation rules have low computation cost. The time complexities are nearly linear, which are of the same order as averaging–the default choice for non-Byzantine aggregation.

2 Model

We consider the following optimization problem:

where , is sampled from some unknown distribution . We assume that there exists a minimizer of , which is denoted by .

We solve this problem in a distributed manner with workers. In each iteration, each worker will sample independent and identically distributed (i.i.d.) data points from the distribution , and compute the gradient of the local empirical loss , where is the th sampled data on the th worker. The servers will collect and aggregate the gradients sent by the workers, and update the model as follows:

where is an aggregation rule (e.g., averaging), and is the set of gradient estimators received by the servers in the iteration. Under Byzantine failures/attacks, is partially replaced by arbitrary values, which yields .

3 Byzantine resilience

In this section, we formally define the classic Byzantine resilience property and its generalized version: dimensional Byzantine resilience.

Suppose that in a specific iteration, the correct vectors are i.i.d samples drawn from the random variable , where is an unbiased estimator of the gradient based on the current parameter . Thus, , for any . We simplify the notations by ignoring the index of iteration .

We first introduce the classic Byzantine model, which is reformulated from the model proposed by Blanchard et al. (2017). With the Byzantine workers, the vectors which are actually received by the server nodes are as follows:

Definition 1 (Classic Byzantine Model).
(1)

Note that the indices of Byzantine workers can change across different iterations. Furthermore, the server nodes are not aware of which workers are Byzantine. The only information given is the number of Byzantine workers, if necessary.

We then introduce the classic Byzantine resilience.

Definition 2.

(Classic -Byzantine Resilience). Assume that . Let be any i.i.d. random vectors in , , with . Let be the set of vectors, of which up to of them are replaced by arbitrary vectors in , while the others still equal to the corresponding . Aggregation rule is said to be classic -Byzantine resilient if where is a constant dependent on and .

The baseline algorithm Krum is defined as follows.

Definition 3.

Krum chooses the vector with the minimal local sum of distances: where is the indices of the nearest neighbours of in measured by Euclidean distance.

The Krum aggregation is classic -Byzantine resilient under certain assumptions. The proof is given by Proposition 1 of Blanchard et al. (2017).

Lemma 1 (Blanchard et al. (2017)).

Let be any i.i.d. random -dimensional vectors s.t. , with and . of are Byzantine. If , we have where

The generalized Byzantine model is denoted as:

Definition 4 (Generalized Byzantine Model).
(2)

where is the th dimension of the vector .

Based on the Byzantine model above, we introduce a generalized Byzantine resilience property, dimensional -Byzantine resilience, which is defined as follows:

Definition 5.

(Dimensional -Byzantine Resilience). Assume that . Let be any i.i.d. random vectors in , , with . Let be the set of candidate vectors. For each dimension, up to of the values are replaced by arbitrary values, i.e., for dimension , of are Byzantine, where is the th dimension of the vector . Aggregation rule is said to be dimensional -Byzantine resilient if where is a constant dependent on and .

Note that classic -Byzantine resilience is a special case of dimensional -Byzantine resilience. For classic Byzantine resilience defined in Definition 2, all the Byzantine values must lie in the same subset of workers, as shown in Figure 1(a).

In the following propositions, we show that Mean and Krum are not dimensional Byzantine resilient ( is unbounded). The proofs are provided in the appendix.

Proposition 1.

The averaging aggregation rule is not dimensional Byzantine-resilient.

Proposition 2.

Any aggregation rule that outputs is not dimensional Byzantine resilient.

Krum chooses the vector with the minimal score, which is not dimensional Byzantine-resilient.

Proposition 3.

is not dimensional Byzantine-resilient.

4 Trimmed-mean-based aggregation

With the Byzantine failure model defined in Equation (1) and (2), we propose two trimmed-mean-based aggregation rules, which are Byzantine resilient under certain conditions.

4.1 Trimmed mean

To define the trimmed mean, we first define the order statistics.

Definition 6.

(Order Statistics) By sorting the scalar sequence , we get , where is the th smallest element in .

Then, we define the trimmed mean.

Definition 7.

(Trimmed Mean) For , the -trimmed mean of the set of scalars is defined as follows:

where is the th smallest element in defined in Definition 6. The high-dimensional version, , simply applies in the coordinate-wise manner.

The following theorem claims that by using , the resulting vector is dimensional Byzantine resilient. A proof is provided in the appendix.

Theorem 1.

(Bounded Variance) Let be any i.i.d. random -dimensional vectors s.t. , with and . In each dimension, values are Byzantine, which yields . If , we have where

Theorem 1 tells us that the upper bound of the variance decreases when increases, decreases, decreases, or decreases.

4.2 Beyond trimmed mean

Using the trimmed mean, we have to drop elements for each dimension. In this section, we explore the possibility of aggregating more elements. To be more specific, for each dimension, we take the average of the values nearest to the trimmed mean. We call the resulting aggregation rule Phocas 111The name of a Byzantine emperor., which is defined as follows:

Definition 8.

(Phocas) We sort the scalar sequence by using the distance to a certain value : where is the th nearest element to in . Phocas is the average of the first nearest elements to the -trimmed mean :

The high-dimensional version, , simply applies in the coordinate-wise manner.

We show that is dimensional Byzantine-resilient.

Theorem 2.

(Bounded Variance) Let be any i.i.d. random -dimensional vectors s.t. , with and . In each dimension, values are Byzantine, which yields If , we have where

The Phocas aggregation can be viewed as a trimmed average centering at the trimmed mean, which filters out the values far away from the trimmed mean. Similar to the trimmed mean, the variance of Phocas decreases when increases, decreases, decreases, or decreases.

4.3 Convergence analysis

In this section, we provide the convergence guarantees for synchronous SGD with -Byzantine-resilient aggregation rules. The proofs can be found in the appendix. We first introduce the two conditions necessary in our convergence analysis.

Definition 9.

If is -smooth, then where . If is -strongly convex, then where .

First, we prove that for strongly convex and smooth loss functions, SGD with -Byzantine-resilient aggregation rules has linear convergence with a constant error.

Theorem 3.

Assume that is -strongly convex and -smooth, where . We take . In any iteration , the correct gradients are . Using any (classic or dimensional) -Byzantine-resilient aggregation rule with corresponding assumptions, we obtain linear convergence with a constant error after iterations with synchronous SGD:

Then, we prove the convergence of SGD for general smooth loss functions.

Theorem 4.

Assume that is -smooth and potentially non-convex, where . We take . In any iteration , the correct gradients are . Using any (classic or dimensional) -Byzantine-resilient aggregation rule with corresponding assumptions, we obtain linear convergence with a constant error after iterations with synchronous SGD:

4.4 Time complexity

For the trimmed mean, we only need to find the order statistics of each dimension. To do so, we use the so-called selection algorithm Blum et al. (1973) with linear time complexity to find the th smallest element. In general, the time complexity is . When is large, the factor can be ignored, which yields the nearly linear time complexity . When is small, the time complexity is the same as the sorting algorithm, which is For Phocas, the computation additional to computing the trimmed takes linear time . Thus, the time complexity is the same as Trmean. Note that for Krum and Multi-Krum, the time complexity is  Blanchard et al. (2017).

Dataset # train # test # rounds Batchsize Evaluation metric
MNIST Loosli et al. (2007) 60k 10k 0.1 500 32 top-1 accuracy
CIFAR10 Krizhevsky and Hinton (2009) 50k 10k 5e-4 4000 128 top-3 accuracy
Table 1: Experiment Summary
(a) Gaussian attack. 6 out of 20 gradient vectors are replaced by i.i.d. random vectors drawn from a Gaussian distribution with 0 mean and 200 standard deviation.
(b) Omniscient attack. 6 out of 20 gradient vectors are replaced by the negative sum of all the correct gradients, scaled by a large constant (1e20).
(c) Bit-flip attack. For each of the first 1000 dimensions, 1 of the 20 floating numbers is manipulated by flipping the 22th, 30th, 31th and 32th bits.
(d) Gambler attack. The parameters are evenly assigned to 20 servers. For one single server, any received value is multiplied by with probability 0.05%.
Figure 2: Top-1 accuracy of MLP on MNIST with different attacks.
(a) Accuracy of -based aggregations under bit-flip attack, at the end of training, when varies.
(b) Maximal accuracy under gambler attack throughout training, when  ( for Krum and Multi-Krum) varies.
Figure 3: Sensitivity to hyperparameters.

5 Experiments

In this section, we evaluate the Byzantine resilience properties of the proposed algorithms. We consider two image classification tasks: handwritten digits classification on MNIST dataset using multi-layer perceptron (MLP), and object recognition on convolutional neural network (CNN). The details of these two neural networks can be found in the appendix. There are worker processes. We repeat each experiment for ten times and take the average. The details of the datasets and the default hyperparameters of the corresponding models are listed in Table 1. We use top-1 or top-3 accuracy on testing sets (disjoint from the training sets) as evaluation metrics.

The baseline aggregation rules are Mean, Krum (Definition 3), and Multi-Krum. Multi-Krum is a variant of Krum defined in Blanchard et al. (2017), which takes the average on several vectors selected by multiple rounds of Krum. We also include the averaging without Byzantine failures as a baseline, which is referred to as Mean without Byzantine. We compare these baseline algorithms with the proposed algorithms: Trmean defined in Definition 7, and Phocas defined in Definition 8, under different attacks.

Note that all the experiments of CNN on CIFAR10 show similar results with the experiments of MLP on MNIST. Thus, we put the results of CNN in the appendix.

5.1 Byzantine resilience

In this section, we test the Byzantine resilience of the proposed algorithms under different kinds of attacks. The zoomed figure of each experiment can be found in the appendix.

5.1.1 Gaussian attack

We test classic Byzantine resilience in this experiment. We consider the attackers that replace some of the gradient vectors with Gaussian random vectors with zero mean and isotropic covariance matrix with standard deviation 200. We refer to this kind of attack as Gaussian attack. 6 out of the 20 gradient vectors are Byzantine. The results are shown in Figure 2(a). As expected, averaging is not Byzantine resilient. The gaps between all the other algorithms are tiny. Phocas performs like there are no Byzantine failures at all. Krum, Multi-Krum, and Trmean converge slightly slower.

5.1.2 Omniscient attack

We test classic Byzantine resilience in this experiment. This kind of attacker is assumed to know the all the correct gradients. For each Byzantine gradient vector, the gradient is replaced by the negative sum of all the correct gradients, scaled by a large constant (1e20 in the experiments). Roughly speaking, this attack tries to make the parameter server go into the opposite direction with a long step. 6 out of the 20 gradient vectors are Byzantine. The results are shown in Figure 2(b). Phocas still performs just like there is no failure. Multi-Krum is not as good as Phocas, but the gap is small. Krum converges slower. However, Trmean converges to bad solutions.

5.1.3 Bit-flip attack

We test dimensional Byzantine resilience in this experiment. Knowing the information of other workers can be difficult in practice. Thus, we use more realistic scenario in this experiment. The attacker only manipulates some individual floating numbers by flipping the 22th, 30th, 31th and 32th bits. Furthermore, we test dimensional Byzantine resilience in this experiment. For each of the first 1000 dimensions, 1 of the 20 floating numbers is manipulated using the bit-flip attack. The results are shown in Figure 2(c). As expected, only Phocas and Trmean are dimensional Byzantine resilient.

Note that for Krum and Multi-Krum, their assumption requires the number of Byzantine vectors to satisfy , which means in our experiments. However, because each gradient is partially manipulated, all the vectors are Byzantine, which breaks the assumption of the Krum-based algorithms. Furthermore, to compute the distances to the -nearest neighbours, must be positive. To test the performance of Krum and Multi-Krum, we set for these two algorithms so that they can still be executed. Furthermore, we test whether tuning can make a difference. The results are shown in Figure 3(a). Obviously, whatever we use, Krum-based algorithms get stuck around bad solutions.

5.1.4 General attack with multiple servers

We test general Byzantine resilience in this experiment. We evaluate the robust aggregation rules under a more general and realistic type of attack. It is very popular to partition the parameters into disjoint subsets, and use multiple server nodes to storage and aggregate them (Li et al., 2014a, b; Ho et al., 2013). We assume that the parameters are evenly partitioned and assigned to the server nodes. The attacker picks one server, and manipulates any floating number by multiplying , with probability of . We call this attack gambler, because the attacker randomly manipulates the values, with the goal that in some iterations the assumptions/prerequisites of the robust aggregation rules are broken, which crashes the training. Such an attack requires less global information, and can be concentrated on one single server, which makes it more realistic and easier to implement.

In Figure 2(d), we evaluate the performance of all the robust aggregation rules under the gambler attack. The number of servers is . For Krum, Multi-Krum and Phocas, we set the estimated Byzantine number . Only Phocas and Trmean survive under this attack. The convergence is slightly slower than the averaging without Byzantine values, but the gaps are small.

5.1.5 Sensitivity to the hyperparameters

We test the robustness to the estimated number of Byzantine workers  ( for Krum and Multi-Krum) in this experiment. We show the maximal accuracy throughout the training. The results are shown in Figure 3(b). The performance of Phocas and Trmean does not significantly change when varies.

5.2 Discussion

As expected, Mean aggregation is not Byzantine resilient. Krum, Multi-Krum are classic Byzantine-resilient but not dimensional Byzantine-resilient. Phocas and Trmean are dimensional Byzantine resilient. However, under omniscient attack, Trmean suffers from larger variances, which slow down the convergence.

The gambler attack shows the true advantage of dimensional Byzantine resilience: higher probability of survival. Under such attack, chances are that the assumptions/prerequisites of Phocas and Trmean may still fail. However, their probability of crashing is less than the other algorithms because dimensional Byzantine resilience generalizes classic Byzantine resilience. An interesting observation is that Trmean is slightly better than Phocas under gambler attack. That is because the estimation of is not accurate, which will cause some unpredictable behavior for Phocas. We choose because it is the maximal value we can take for Krum and Multi-Krum.

It is obvious that Phocas performs best in almost cases. Multi-Krum is also good, except that it is not dimensional Byzantine-resilient. The reason why Phocas and Multi-Krum have better performance is that they aggregate more candidates to stabilize the convergence. Note that Phocas not only performs just as well as or even better than Multi-Krum, but also has lower time complexity.

Trmean has the cheapest computation. Its worst case, omniscient attack, is hard to implement in reality. Thus, for most applications, we suggest Trmean as an easy-to-implement aggregation rule with robust performance. However, if we assume the worst cases of the attacks/failures, Phocas should be adopted for best robustness.

6 Related work

Our work is closely related to Blanchard et al. (2017) and Yin et al. (2018). Another paper Chen et al. (2017) proposed grouped geometric median for Byzantine resilience, with strongly convex functions.

Our approach offers the following important advantages over the previous work.

  • Cheaper computation compared to Krum. Trmean and Phocas have nearly linear time complexity, while the time complexity of Krum is .

  • Dimensional Byzantine resilience. Trmean and Phocas tolerate a more general type of Byzantine failures described in Equation (2) and Definition 5, while Krum can only tolerate the classic Byzantine failures described in Equation (1) and Definition 2.

  • Simpler dimension-free convergence guarantees with fewer assumptions. Yin et al. (2018) also study the Byzantine resilience of the trimmed mean and its special case, median. However, in that work, the bounds grow with the number of dimensions , even if the variance of gradients is fixed. To establish the bounds, the authors assume bounded domain, and sub-exponential gradients with bounded skewness, which are not required in our theoretical analysis. In this paper, we use fewer assumptions to prove the dimension-free theoretical guarantees of trimmed mean.

The major contribution of this paper is a combination of theory and practice. First, we provide the theoretical guarantee of the convergence of the trimmed mean with fewer assumptions. Then, we propose a novel aggregation rule, Phocas, which has comparable theoretical guarantees, and comparable or even better performance in the experiments.

7 Conclusion

We investigate the generalized Byzantine resilience, and propose trimmed-mean-based aggregation rules for synchronous SGD. The algorithms have low time complexity and provable convergence. Our empirical results show good performance. We will study the Byzantine resilience in other scenarios such as asynchronous training in the future work.

References

  • Alistarh et al. (2018) D. Alistarh, Z. Allen-Zhu, and J. Li. Byzantine stochastic gradient descent. arXiv preprint arXiv:1803.08917, 2018.
  • Blanchard et al. (2017) P. Blanchard, R. Guerraoui, J. Stainer, et al. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, pages 118–128, 2017.
  • Blum et al. (1973) M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan. Time bounds for selection. Journal of computer and system sciences, 7(4):448–461, 1973.
  • Bubeck et al. (2015) S. Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
  • Chen et al. (2017) Y. Chen, L. Su, and J. Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. arXiv preprint arXiv:1705.05491, 2017.
  • Dean et al. (2012) J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, 2012.
  • Harinath et al. (2017) D. Harinath, P. Satyanarayana, and M. R. Murthy. A review on security issues and attacks in distributed systems. Journal of Advances in Information Technology, 8(1), 2017.
  • Ho et al. (2013) Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. R. Ganger, and E. P. Xing. More effective distributed ml via a stale synchronous parallel parameter server. Advances in neural information processing systems, 2013:1223–1231, 2013.
  • Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  • Krizhevsky and Hinton (2009) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
  • Lamport et al. (1982) L. Lamport, R. E. Shostak, and M. C. Pease. The byzantine generals problem. ACM Trans. Program. Lang. Syst., 4:382–401, 1982.
  • Lee et al. (2017) J. Lee, D. Hwang, J. Park, and K.-H. Kim. Risk analysis and countermeasure for bit-flipping attack in lorawan. In Information Networking (ICOIN), 2017 International Conference on, pages 549–551. IEEE, 2017.
  • Li et al. (2014a) M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, 2014a.
  • Li et al. (2014b) M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication efficient distributed machine learning with the parameter server. In NIPS, 2014b.
  • Loosli et al. (2007) G. Loosli, S. Canu, and L. Bottou. Training invariant support vector machines using selective sampling. Large scale kernel machines, pages 301–320, 2007.
  • McMahan et al. (2017) H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In AISTATS, 2017.
  • Mukkamala and Hein (2017) M. C. Mukkamala and M. Hein. Variants of rmsprop and adagrad with logarithmic regret bounds. In ICML, 2017.
  • Yin et al. (2018) D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. arXiv preprint arXiv:1803.01498, 2018.

8 Appendix

In the appendix, we introduce several useful lemmas and use them to derive the detailed proofs of the theorems in this paper.

8.1 Dimensional Byzantine resilience

Lemma 1 (Blanchard et al. [2017]).

Let be any i.i.d. random -dimensional vectors s.t. , with and . of are Byzantine. If , we have where

Proof.

We denote the correct values as , and . Using Blanchard et al. [2017] Proposition 1, we have

where is the set of correct elements in the nearest neighbours to in , measured by Euclidean distance. Thus, we obtain

Proposition 1.

Averaging is not dimensional Byzantine resilient.

Proof.

We demonstrate a counter example. Consider the case where

(3)

where , . Thus, the resulting aggregation is . The inner product is always negative under the Byzantine attack. Thus, SGD is not expectedly descendant, which means it will not converge to critical points. Note that in this counter example, the number of Byzantine values of each dimension is .

Hence, averaging is not dimensional Byzantine-resilient. ∎

Proposition 2.

Any aggregation rule that outputs is not dimensional Byzantine resilient.

Proof.

We demonstrate a counter example. Consider the case where the th dimension of the th vector is manipulated by the malicious workers (e.g. multiplied by an arbitrarily large negative value), where . Thus, up to 1 value of each dimension is Byzantine. However, no matter which vector is chosen, as long as the aggregation is chosen from , the inner product can be arbitrarily large negative value under the Byzantine attack. Thus, SGD is not expectedly descendant, which means it will not converge to critical points.

Hence, any aggregation rule that outputs is not dimensional Byzantine-resilient. ∎

8.2 Trimmed mean

We use the following lemma to bound the one-dimensional trimmed mean.

Lemma 2.

Assume that among the scalar sequence , elements are Byzantine. Without loss of generality, we denote the remaining correct values as . Thus, for , , for , where is the th smallest element in , and is the th smallest element in .

Proof.

We prove the two inequalities separately.
(i) We prove the first inequality by contradiction.
If , then there will be correct values larger than . However, because is the -th smallest element in the sequence , there is at most elements larger than , which yields a contradiction.

(ii) We prove the second inequality by contradiction.
If , then there will be correct values smaller than . However, because is the -th smallest element in the sequence , there is at most elements smaller than , which yields a contradiction. ∎

Theorem 1.

(Bounded Variance) Let be any i.i.d. random -dimensional vectors s.t. , with and . In each dimension, values are Byzantine, which yields . If , we have where

Proof.

We first assume that all the ’s, ’s, and are scalars, with the variance . Using Lemma 2, we have

Thus, we have

Note that for arbitrary subset , , we have the following bound:

By taking the expectations, we obtain

Combining all the ingredients above, we obtain the desired result:

Then, we generalize ’s, ’s, and to -dimensional vectors with the variance , where , . For , we have

Thus, we have

8.3 Phocas

The following lemma bounds the one-dimensional .

Lemma 3.

For the scalar sequence , and the corresponding trimmed mean , we have

where , is the sequence of correct values in .

Proof.

We prove this lemma by contradiction.

Assume that . Thus, there are correct values closer than to . However, according to the definition of , it is the th closest value to , which means that there are at most values closer than to , which yields a contradiction. ∎

Theorem 2.

(Bounded Variance) Let be any i.i.d. random -dimensional vectors s.t. , with and . In each dimension, values are Byzantine, which yields If , we have where

Proof.

We first assume that all the ’s, ’s, and are scalars, with the variance . For convenience, we denote . Thus, we have

Using Theorem 1, we already have

We only need to bound as follows:

Taking expectation on both sides, we have