Phocas: dimensional Byzantineresilient stochastic gradient descent
Abstract
We propose a novel robust aggregation rule for distributed synchronous Stochastic Gradient Descent (SGD) under a general Byzantine failure model. The attackers can arbitrarily manipulate the data transferred between the servers and the workers in the parameter server (PS) architecture. We prove the Byzantine resilience of the proposed aggregation rules. Empirical analysis shows that the proposed techniques outperform current approaches for realistic use cases and Byzantine attack scenarios.
1 Introduction
The failure resilience of distributed machinelearning systems has attracted increasing attention (Blanchard et al., 2017; Chen et al., 2017; Yin et al., 2018; Alistarh et al., 2018) in the community. Larger clusters can accelerate training. However, this makes the distributed system more vulnerable to different kinds of failures or even attacks (Harinath et al., 2017). Thus, failure/attack resilience is becoming more and more important for distributed machinelearning systems, especially for largescale deep learning (Dean et al., 2012; McMahan et al., 2017).
In this paper, we consider the most general failure model, Byzantine failures (Lamport et al., 1982), where the attackers can know any information of the other processes, and attack any value in transmission. To be more specific, the data transmission between the machines can be replaced by arbitrary values. Under such model, there are no constraints on the failures or attackers.
The distributed training framework studied in this paper is the Parameter Server (PS). The PS architecture is composed of the server nodes and the worker nodes. The server nodes maintain a global copy of the model, aggregate the gradients from the workers, apply the gradients to the model, and broadcast the latest model to the workers. The worker nodes pull the latest model from the server nodes, compute the gradients according to the local portion of the training data, and send the gradients to the server nodes. The entire dataset and the corresponding workload is distributed to multiple worker nodes, thus parallelizing the computation via partitioning the dataset.
In this paper, we study the Byzantine resilience of synchronous Stochastic Gradient Descent (SGD), which is a popular class of learning algorithms using PS architecture. Its variants are widely used in training deep neural networks Kingma and Ba (2014); Mukkamala and Hein (2017). Such algorithms always wait to collect gradients from all the worker nodes before moving on to the next iteration.
The failure model can be described by using an matrix consisting of the dimensional gradients produced by workers, as visualized in Figure 1. Previous work Blanchard et al. (2017) has so far addressed a special case, where the Byzantine values must lie in the same rows (workers) as shown in Figure 1(a). Our failure model generalizes the classic Byzantine failure model by placing the Byzantine values anywhere in the matrix without any constraint. For example, Lee et al. (2017) describes the vulnerability to bitflipping attacks of a wireless transmission technology. The servers could receive data via such vulnerable communication media, even if the messages are encrypted. As a result, an arbitrary fraction of the received values are Byzantine.
There are two limitations lying in most of the existing Byzantineresilient SGD algorithms Blanchard et al. (2017); Chen et al. (2017). First, they only consider the classic Byzantine model shown in Figure 1(a). However, the Byzantine failures can also happen in the communication media/interfaces on the server side, which yields the generalized Byzantine model shown in Figure 1(b). Second, the algorithms are based on the Euclidean norm, which suffers from the curse of dimensionality. When the dimension gets higher, it will be more difficult to distinguish the Byzantine gradients from the correct ones.
In this paper, we study the dimensional Byzantineresilient algorithms, which tolerate the generalized Byzantine model under certain conditions, and are not affected by the curse of dimensionality. We propose Byzantineresilient trimmedmeanbased aggregation rules. We assume that for each dimension, the number of Byzantine values must be less than the number of correct ones. The resilience to such Byzantine model is called “dimensional Byzantine resilience". The main contributions of this paper are listed below:

We formulate the dimensional Byzantine resilience property, and prove that the proposed trimmedmeanbased approaches are dimensional Byzantineresilient (Definition 5). As far as we know, this paper is the first one to study generalized Byzantine failures and dimensional Byzantine resilience for synchronous SGD.

We show that the proposed aggregation rules have low computation cost. The time complexities are nearly linear, which are of the same order as averaging–the default choice for nonByzantine aggregation.
2 Model
We consider the following optimization problem:
where , is sampled from some unknown distribution . We assume that there exists a minimizer of , which is denoted by .
We solve this problem in a distributed manner with workers. In each iteration, each worker will sample independent and identically distributed (i.i.d.) data points from the distribution , and compute the gradient of the local empirical loss , where is the th sampled data on the th worker. The servers will collect and aggregate the gradients sent by the workers, and update the model as follows:
where is an aggregation rule (e.g., averaging), and is the set of gradient estimators received by the servers in the iteration. Under Byzantine failures/attacks, is partially replaced by arbitrary values, which yields .
3 Byzantine resilience
In this section, we formally define the classic Byzantine resilience property and its generalized version: dimensional Byzantine resilience.
Suppose that in a specific iteration, the correct vectors are i.i.d samples drawn from the random variable , where is an unbiased estimator of the gradient based on the current parameter . Thus, , for any . We simplify the notations by ignoring the index of iteration .
We first introduce the classic Byzantine model, which is reformulated from the model proposed by Blanchard et al. (2017). With the Byzantine workers, the vectors which are actually received by the server nodes are as follows:
Definition 1 (Classic Byzantine Model).
(1) 
Note that the indices of Byzantine workers can change across different iterations. Furthermore, the server nodes are not aware of which workers are Byzantine. The only information given is the number of Byzantine workers, if necessary.
We then introduce the classic Byzantine resilience.
Definition 2.
(Classic Byzantine Resilience). Assume that . Let be any i.i.d. random vectors in , , with . Let be the set of vectors, of which up to of them are replaced by arbitrary vectors in , while the others still equal to the corresponding . Aggregation rule is said to be classic Byzantine resilient if where is a constant dependent on and .
The baseline algorithm Krum is defined as follows.
Definition 3.
Krum chooses the vector with the minimal local sum of distances: where is the indices of the nearest neighbours of in measured by Euclidean distance.
The Krum aggregation is classic Byzantine resilient under certain assumptions. The proof is given by Proposition 1 of Blanchard et al. (2017).
Lemma 1 (Blanchard et al. (2017)).
Let be any i.i.d. random dimensional vectors s.t. , with and . of are Byzantine. If , we have where
The generalized Byzantine model is denoted as:
Definition 4 (Generalized Byzantine Model).
(2) 
where is the th dimension of the vector .
Based on the Byzantine model above, we introduce a generalized Byzantine resilience property, dimensional Byzantine resilience, which is defined as follows:
Definition 5.
(Dimensional Byzantine Resilience). Assume that . Let be any i.i.d. random vectors in , , with . Let be the set of candidate vectors. For each dimension, up to of the values are replaced by arbitrary values, i.e., for dimension , of are Byzantine, where is the th dimension of the vector . Aggregation rule is said to be dimensional Byzantine resilient if where is a constant dependent on and .
Note that classic Byzantine resilience is a special case of dimensional Byzantine resilience. For classic Byzantine resilience defined in Definition 2, all the Byzantine values must lie in the same subset of workers, as shown in Figure 1(a).
In the following propositions, we show that Mean and Krum are not dimensional Byzantine resilient ( is unbounded). The proofs are provided in the appendix.
Proposition 1.
The averaging aggregation rule is not dimensional Byzantineresilient.
Proposition 2.
Any aggregation rule that outputs is not dimensional Byzantine resilient.
Krum chooses the vector with the minimal score, which is not dimensional Byzantineresilient.
Proposition 3.
is not dimensional Byzantineresilient.
4 Trimmedmeanbased aggregation
With the Byzantine failure model defined in Equation (1) and (2), we propose two trimmedmeanbased aggregation rules, which are Byzantine resilient under certain conditions.
4.1 Trimmed mean
To define the trimmed mean, we first define the order statistics.
Definition 6.
(Order Statistics) By sorting the scalar sequence , we get , where is the th smallest element in .
Then, we define the trimmed mean.
Definition 7.
(Trimmed Mean) For , the trimmed mean of the set of scalars is defined as follows:
where is the th smallest element in defined in Definition 6. The highdimensional version, , simply applies in the coordinatewise manner.
The following theorem claims that by using , the resulting vector is dimensional Byzantine resilient. A proof is provided in the appendix.
Theorem 1.
(Bounded Variance) Let be any i.i.d. random dimensional vectors s.t. , with and . In each dimension, values are Byzantine, which yields . If , we have where
Theorem 1 tells us that the upper bound of the variance decreases when increases, decreases, decreases, or decreases.
4.2 Beyond trimmed mean
Using the trimmed mean, we have to drop elements for each dimension. In this section, we explore the possibility of aggregating more elements. To be more specific, for each dimension, we take the average of the values nearest to the trimmed mean. We call the resulting aggregation rule Phocas ^{1}^{1}1The name of a Byzantine emperor., which is defined as follows:
Definition 8.
(Phocas) We sort the scalar sequence by using the distance to a certain value : where is the th nearest element to in . Phocas is the average of the first nearest elements to the trimmed mean :
The highdimensional version, , simply applies in the coordinatewise manner.
We show that is dimensional Byzantineresilient.
Theorem 2.
(Bounded Variance) Let be any i.i.d. random dimensional vectors s.t. , with and . In each dimension, values are Byzantine, which yields If , we have where
The Phocas aggregation can be viewed as a trimmed average centering at the trimmed mean, which filters out the values far away from the trimmed mean. Similar to the trimmed mean, the variance of Phocas decreases when increases, decreases, decreases, or decreases.
4.3 Convergence analysis
In this section, we provide the convergence guarantees for synchronous SGD with Byzantineresilient aggregation rules. The proofs can be found in the appendix. We first introduce the two conditions necessary in our convergence analysis.
Definition 9.
If is smooth, then where . If is strongly convex, then where .
First, we prove that for strongly convex and smooth loss functions, SGD with Byzantineresilient aggregation rules has linear convergence with a constant error.
Theorem 3.
Assume that is strongly convex and smooth, where . We take . In any iteration , the correct gradients are . Using any (classic or dimensional) Byzantineresilient aggregation rule with corresponding assumptions, we obtain linear convergence with a constant error after iterations with synchronous SGD:
Then, we prove the convergence of SGD for general smooth loss functions.
Theorem 4.
Assume that is smooth and potentially nonconvex, where . We take . In any iteration , the correct gradients are . Using any (classic or dimensional) Byzantineresilient aggregation rule with corresponding assumptions, we obtain linear convergence with a constant error after iterations with synchronous SGD:
4.4 Time complexity
For the trimmed mean, we only need to find the order statistics of each dimension. To do so, we use the socalled selection algorithm Blum et al. (1973) with linear time complexity to find the th smallest element. In general, the time complexity is . When is large, the factor can be ignored, which yields the nearly linear time complexity . When is small, the time complexity is the same as the sorting algorithm, which is For Phocas, the computation additional to computing the trimmed takes linear time . Thus, the time complexity is the same as Trmean. Note that for Krum and MultiKrum, the time complexity is Blanchard et al. (2017).
Dataset  # train  # test  # rounds  Batchsize  Evaluation metric  

MNIST Loosli et al. (2007)  60k  10k  0.1  500  32  top1 accuracy 
CIFAR10 Krizhevsky and Hinton (2009)  50k  10k  5e4  4000  128  top3 accuracy 
5 Experiments
In this section, we evaluate the Byzantine resilience properties of the proposed algorithms. We consider two image classification tasks: handwritten digits classification on MNIST dataset using multilayer perceptron (MLP), and object recognition on convolutional neural network (CNN). The details of these two neural networks can be found in the appendix. There are worker processes. We repeat each experiment for ten times and take the average. The details of the datasets and the default hyperparameters of the corresponding models are listed in Table 1. We use top1 or top3 accuracy on testing sets (disjoint from the training sets) as evaluation metrics.
The baseline aggregation rules are Mean, Krum (Definition 3), and MultiKrum. MultiKrum is a variant of Krum defined in Blanchard et al. (2017), which takes the average on several vectors selected by multiple rounds of Krum. We also include the averaging without Byzantine failures as a baseline, which is referred to as Mean without Byzantine. We compare these baseline algorithms with the proposed algorithms: Trmean defined in Definition 7, and Phocas defined in Definition 8, under different attacks.
Note that all the experiments of CNN on CIFAR10 show similar results with the experiments of MLP on MNIST. Thus, we put the results of CNN in the appendix.
5.1 Byzantine resilience
In this section, we test the Byzantine resilience of the proposed algorithms under different kinds of attacks. The zoomed figure of each experiment can be found in the appendix.
5.1.1 Gaussian attack
We test classic Byzantine resilience in this experiment. We consider the attackers that replace some of the gradient vectors with Gaussian random vectors with zero mean and isotropic covariance matrix with standard deviation 200. We refer to this kind of attack as Gaussian attack. 6 out of the 20 gradient vectors are Byzantine. The results are shown in Figure 2(a). As expected, averaging is not Byzantine resilient. The gaps between all the other algorithms are tiny. Phocas performs like there are no Byzantine failures at all. Krum, MultiKrum, and Trmean converge slightly slower.
5.1.2 Omniscient attack
We test classic Byzantine resilience in this experiment. This kind of attacker is assumed to know the all the correct gradients. For each Byzantine gradient vector, the gradient is replaced by the negative sum of all the correct gradients, scaled by a large constant (1e20 in the experiments). Roughly speaking, this attack tries to make the parameter server go into the opposite direction with a long step. 6 out of the 20 gradient vectors are Byzantine. The results are shown in Figure 2(b). Phocas still performs just like there is no failure. MultiKrum is not as good as Phocas, but the gap is small. Krum converges slower. However, Trmean converges to bad solutions.
5.1.3 Bitflip attack
We test dimensional Byzantine resilience in this experiment. Knowing the information of other workers can be difficult in practice. Thus, we use more realistic scenario in this experiment. The attacker only manipulates some individual floating numbers by flipping the 22th, 30th, 31th and 32th bits. Furthermore, we test dimensional Byzantine resilience in this experiment. For each of the first 1000 dimensions, 1 of the 20 floating numbers is manipulated using the bitflip attack. The results are shown in Figure 2(c). As expected, only Phocas and Trmean are dimensional Byzantine resilient.
Note that for Krum and MultiKrum, their assumption requires the number of Byzantine vectors to satisfy , which means in our experiments. However, because each gradient is partially manipulated, all the vectors are Byzantine, which breaks the assumption of the Krumbased algorithms. Furthermore, to compute the distances to the nearest neighbours, must be positive. To test the performance of Krum and MultiKrum, we set for these two algorithms so that they can still be executed. Furthermore, we test whether tuning can make a difference. The results are shown in Figure 3(a). Obviously, whatever we use, Krumbased algorithms get stuck around bad solutions.
5.1.4 General attack with multiple servers
We test general Byzantine resilience in this experiment. We evaluate the robust aggregation rules under a more general and realistic type of attack. It is very popular to partition the parameters into disjoint subsets, and use multiple server nodes to storage and aggregate them (Li et al., 2014a, b; Ho et al., 2013). We assume that the parameters are evenly partitioned and assigned to the server nodes. The attacker picks one server, and manipulates any floating number by multiplying , with probability of . We call this attack gambler, because the attacker randomly manipulates the values, with the goal that in some iterations the assumptions/prerequisites of the robust aggregation rules are broken, which crashes the training. Such an attack requires less global information, and can be concentrated on one single server, which makes it more realistic and easier to implement.
In Figure 2(d), we evaluate the performance of all the robust aggregation rules under the gambler attack. The number of servers is . For Krum, MultiKrum and Phocas, we set the estimated Byzantine number . Only Phocas and Trmean survive under this attack. The convergence is slightly slower than the averaging without Byzantine values, but the gaps are small.
5.1.5 Sensitivity to the hyperparameters
We test the robustness to the estimated number of Byzantine workers ( for Krum and MultiKrum) in this experiment. We show the maximal accuracy throughout the training. The results are shown in Figure 3(b). The performance of Phocas and Trmean does not significantly change when varies.
5.2 Discussion
As expected, Mean aggregation is not Byzantine resilient. Krum, MultiKrum are classic Byzantineresilient but not dimensional Byzantineresilient. Phocas and Trmean are dimensional Byzantine resilient. However, under omniscient attack, Trmean suffers from larger variances, which slow down the convergence.
The gambler attack shows the true advantage of dimensional Byzantine resilience: higher probability of survival. Under such attack, chances are that the assumptions/prerequisites of Phocas and Trmean may still fail. However, their probability of crashing is less than the other algorithms because dimensional Byzantine resilience generalizes classic Byzantine resilience. An interesting observation is that Trmean is slightly better than Phocas under gambler attack. That is because the estimation of is not accurate, which will cause some unpredictable behavior for Phocas. We choose because it is the maximal value we can take for Krum and MultiKrum.
It is obvious that Phocas performs best in almost cases. MultiKrum is also good, except that it is not dimensional Byzantineresilient. The reason why Phocas and MultiKrum have better performance is that they aggregate more candidates to stabilize the convergence. Note that Phocas not only performs just as well as or even better than MultiKrum, but also has lower time complexity.
Trmean has the cheapest computation. Its worst case, omniscient attack, is hard to implement in reality. Thus, for most applications, we suggest Trmean as an easytoimplement aggregation rule with robust performance. However, if we assume the worst cases of the attacks/failures, Phocas should be adopted for best robustness.
6 Related work
Our work is closely related to Blanchard et al. (2017) and Yin et al. (2018). Another paper Chen et al. (2017) proposed grouped geometric median for Byzantine resilience, with strongly convex functions.
Our approach offers the following important advantages over the previous work.

Cheaper computation compared to Krum. Trmean and Phocas have nearly linear time complexity, while the time complexity of Krum is .

Simpler dimensionfree convergence guarantees with fewer assumptions. Yin et al. (2018) also study the Byzantine resilience of the trimmed mean and its special case, median. However, in that work, the bounds grow with the number of dimensions , even if the variance of gradients is fixed. To establish the bounds, the authors assume bounded domain, and subexponential gradients with bounded skewness, which are not required in our theoretical analysis. In this paper, we use fewer assumptions to prove the dimensionfree theoretical guarantees of trimmed mean.
The major contribution of this paper is a combination of theory and practice. First, we provide the theoretical guarantee of the convergence of the trimmed mean with fewer assumptions. Then, we propose a novel aggregation rule, Phocas, which has comparable theoretical guarantees, and comparable or even better performance in the experiments.
7 Conclusion
We investigate the generalized Byzantine resilience, and propose trimmedmeanbased aggregation rules for synchronous SGD. The algorithms have low time complexity and provable convergence. Our empirical results show good performance. We will study the Byzantine resilience in other scenarios such as asynchronous training in the future work.
References
 Alistarh et al. (2018) D. Alistarh, Z. AllenZhu, and J. Li. Byzantine stochastic gradient descent. arXiv preprint arXiv:1803.08917, 2018.
 Blanchard et al. (2017) P. Blanchard, R. Guerraoui, J. Stainer, et al. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, pages 118–128, 2017.
 Blum et al. (1973) M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan. Time bounds for selection. Journal of computer and system sciences, 7(4):448–461, 1973.
 Bubeck et al. (2015) S. Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(34):231–357, 2015.
 Chen et al. (2017) Y. Chen, L. Su, and J. Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. arXiv preprint arXiv:1705.05491, 2017.
 Dean et al. (2012) J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, 2012.
 Harinath et al. (2017) D. Harinath, P. Satyanarayana, and M. R. Murthy. A review on security issues and attacks in distributed systems. Journal of Advances in Information Technology, 8(1), 2017.
 Ho et al. (2013) Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. R. Ganger, and E. P. Xing. More effective distributed ml via a stale synchronous parallel parameter server. Advances in neural information processing systems, 2013:1223–1231, 2013.
 Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 Krizhevsky and Hinton (2009) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
 Lamport et al. (1982) L. Lamport, R. E. Shostak, and M. C. Pease. The byzantine generals problem. ACM Trans. Program. Lang. Syst., 4:382–401, 1982.
 Lee et al. (2017) J. Lee, D. Hwang, J. Park, and K.H. Kim. Risk analysis and countermeasure for bitflipping attack in lorawan. In Information Networking (ICOIN), 2017 International Conference on, pages 549–551. IEEE, 2017.
 Li et al. (2014a) M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, 2014a.
 Li et al. (2014b) M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication efficient distributed machine learning with the parameter server. In NIPS, 2014b.
 Loosli et al. (2007) G. Loosli, S. Canu, and L. Bottou. Training invariant support vector machines using selective sampling. Large scale kernel machines, pages 301–320, 2007.
 McMahan et al. (2017) H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communicationefficient learning of deep networks from decentralized data. In AISTATS, 2017.
 Mukkamala and Hein (2017) M. C. Mukkamala and M. Hein. Variants of rmsprop and adagrad with logarithmic regret bounds. In ICML, 2017.
 Yin et al. (2018) D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett. Byzantinerobust distributed learning: Towards optimal statistical rates. arXiv preprint arXiv:1803.01498, 2018.
8 Appendix
In the appendix, we introduce several useful lemmas and use them to derive the detailed proofs of the theorems in this paper.
8.1 Dimensional Byzantine resilience
Lemma 1 (Blanchard et al. [2017]).
Let be any i.i.d. random dimensional vectors s.t. , with and . of are Byzantine. If , we have where
Proof.
We denote the correct values as , and . Using Blanchard et al. [2017] Proposition 1, we have
where is the set of correct elements in the nearest neighbours to in , measured by Euclidean distance. Thus, we obtain
∎
Proposition 1.
Averaging is not dimensional Byzantine resilient.
Proof.
We demonstrate a counter example. Consider the case where
(3) 
where , . Thus, the resulting aggregation is . The inner product is always negative under the Byzantine attack. Thus, SGD is not expectedly descendant, which means it will not converge to critical points. Note that in this counter example, the number of Byzantine values of each dimension is .
Hence, averaging is not dimensional Byzantineresilient. ∎
Proposition 2.
Any aggregation rule that outputs is not dimensional Byzantine resilient.
Proof.
We demonstrate a counter example. Consider the case where the th dimension of the th vector is manipulated by the malicious workers (e.g. multiplied by an arbitrarily large negative value), where . Thus, up to 1 value of each dimension is Byzantine. However, no matter which vector is chosen, as long as the aggregation is chosen from , the inner product can be arbitrarily large negative value under the Byzantine attack. Thus, SGD is not expectedly descendant, which means it will not converge to critical points.
Hence, any aggregation rule that outputs is not dimensional Byzantineresilient. ∎
8.2 Trimmed mean
We use the following lemma to bound the onedimensional trimmed mean.
Lemma 2.
Assume that among the scalar sequence , elements are Byzantine. Without loss of generality, we denote the remaining correct values as . Thus, for , , for , where is the th smallest element in , and is the th smallest element in .
Proof.
We prove the two inequalities separately.
(i) We prove the first inequality by contradiction.
If , then there will be correct values larger than . However, because is the th smallest element in the sequence
, there is at most elements larger than , which yields a contradiction.
(ii) We prove the second inequality by contradiction.
If , then there will be correct values smaller than . However, because is the th smallest element in the sequence
, there is at most elements smaller than , which yields a contradiction.
∎
Theorem 1.
(Bounded Variance) Let be any i.i.d. random dimensional vectors s.t. , with and . In each dimension, values are Byzantine, which yields . If , we have where
Proof.
We first assume that all the ’s, ’s, and are scalars, with the variance . Using Lemma 2, we have
Thus, we have
Note that for arbitrary subset , , we have the following bound:
By taking the expectations, we obtain
Combining all the ingredients above, we obtain the desired result:
Then, we generalize ’s, ’s, and to dimensional vectors with the variance , where , . For , we have
Thus, we have
∎
8.3 Phocas
The following lemma bounds the onedimensional .
Lemma 3.
For the scalar sequence , and the corresponding trimmed mean , we have
where , is the sequence of correct values in .
Proof.
We prove this lemma by contradiction.
Assume that . Thus, there are correct values closer than to . However, according to the definition of , it is the th closest value to , which means that there are at most values closer than to , which yields a contradiction. ∎
Theorem 2.
(Bounded Variance) Let be any i.i.d. random dimensional vectors s.t. , with and . In each dimension, values are Byzantine, which yields If , we have where
Proof.
We first assume that all the ’s, ’s, and are scalars, with the variance . For convenience, we denote . Thus, we have
Using Theorem 1, we already have
We only need to bound as follows:
Taking expectation on both sides, we have