# Fast and Secure Distributed Learning in High Dimension^{1}^{1}1A practical implementation based on this work is described in [3], the code is available in the following Github repository: https://github.com/LPD-EPFL/AggregaThor ^{2}^{2}2Work in progress.

###### Abstract

Modern machine learning is distributed and the work of several machines is typically aggregated by averaging which is the optimal rule in terms of speed, offering a speedup of (with respect to using a single machine) when processes are learning together.

Distributing data and models poses however fundamental vulnerabilities, be they to software bugs, asynchrony, or worse, to malicious attackers controlling some machines or injecting misleading data in the network. Such behavior is best modeled as Byzantine failures, and averaging does not tolerate a single one from a worker.

Krum, the first provably Byzantine resilient aggregation rule for SGD only uses one worker per step, which hampers its speed of convergence, especially in best case conditions when none of the workers is actually Byzantine. An idea, coined multi-Krum, of using different workers per step was mentioned, without however any proof neither on its Byzantine resilience nor on its slowdown. More recently, it was shown that in high dimensional machine learning, guaranteeing convergence is not a sufficient condition for strong Byzantine resilience. A improvement on Krum, coined Bulyan, was proposed and proved to guarantee stronger resilience. However, Bulyan suffers from the same weakness of Krum: using only one worker per step. This adds up to the aforementioned open problem and leaves the crucial need for both fast and strong Byzantine resilience unfulfilled.

The present paper tackles both open problems and proposes using Bulyan over Multi-Krum (we call it multi-Bulyan), a combination for which we provide proofs of strong Byzantine resilience, as well as an slowdown, compared to averaging, the fastest (but non Byzantine resilient) rule for distributed machine learning.

Finally, modern machine learning involves data of unprecedentedly high dimension: some models are nowadays vectors of dimension . In order to deliver results within a reasonable time, learning algorithms should be at most linear in and avoid using classic security mechanisms, most of which are at least quadratic in and hence impractical. A strength of multi-Bulyan is that it inherits the merits of both multi-Krum and Bulyan.

algorithm

## 1 Introduction

The ongoing data deluge has been both a blessing and burden for machine learning system designers. A blessing since machine learning provably performs better with more training datan [10], and a burden since the numbers are beyond previous orders of magnitude. For instance, machine learning set of parameters are now in the gigabyte [5], training data is several orders of magnitude beyond that [5]. For the latter reason, distributed machine learning is not an option, it the only way to deliver results in a reasonable time for the user. For instance, Stochastic Gradient Descent (SGD), an algorithm which is the workhorse of today’s machine learning.

With the amounts of workload involved in today’s machine learning, distributed systems are the only option to deliver results in a reasonable time.

This constraint is even more crucial since ML, given its workload, relies on large scale distributed systems for which communications costs are additional constraints to local computation costs.

We prove the similar slowdown of multi-Bulyan and its (strong) Byzantine resilience. We deduce that multi-Bulyan ensures strong Byzantine resilience and the very fact that it is times as fast as the optimal algorithm (averaging) in the absence of Byzantine workers.

multi-Bulyan can be viewed as generalization (also using different workers per step to leverage the fact that , possibly less than a minority can be faulty) of Bulyan, the defense mechanism we present in [6]. Before presenting in Section 3, our proofs of convergence and slow down of multi-Krum and in Section 4 our proofs of convergence and slow down of Bulyan and hence multi-Bulyan, we introduce in Section 2 a toolbox of formal definitions: weak, strong, and –Byzantine resilience. We also present a necessary context on non-convex optimization, as well as its interplay with the high dimensionality of machine learning together with the leeway it provides to strong attackers

## 2 Model

### 2.1 Stochastic Gradient Descent

The learning task consists in making accurate predictions for the labels of each data instance using a high dimentional model (for example, a neural network); we denote the parameters of that model by the vector . Each data instance has a set of features (image pixels), and a set of labels (e.g., {cat, person}). The CNN is trained with the popular backpropagation algorithm based on SGD. Specifically, SGD addresses the following optimization problem.

(1) |

where is a random variable representing a total of data instances and is the loss function. The function is smooth but not convex.

SGD computes the gradient () and then updates the model parameters () in a direction opposite to that of the gradient (descent).
The vanilla SGD update rule given a sequence of learning rates at any given step^{3}^{3}3A step denotes an update in the model parameters. is the following:

(2) |

The popularity of SGD stems from its ability to employ noisy approximations of the actual gradient. In a distributed setup, SGD employs a mini-batch of training instances for the gradient computation:

(3) |

The size of the mini-batch () induces a trade-off between the robustness of a given update (noise in the gradient approximation) and the time required to compute this update. The mini-batch also affects the amount of parallelism (Equation 3) that modern computing clusters (multi-GPU etc.) largely benefit from. Scaling the mini-batch size to exploit additional parallelism requires however a non-trivial selection of the sequence of learning rates [7]. A very important assumption for the convergence properties of SGD is that each gradient is an unbiased estimation of the actual gradient, which is typically ensured through uniform random sampling, i.e., gradients that are on expectation equal to the actual gradient (Figure 1).

### 2.2 Algorithms

multi-Bulyan relies on two algorithmic components: multi-Krum [1] and Bulyan [6]. The former rule requires that and the second requires that .

Intuitively, the goal of multi-Krum is to select the gradients that deviate less from the “majority” based on their relative distances. Given gradients proposed by workers 1 to respectively, multi-Krum selects the gradients with the smallest sum of scores (i.e., L2 norm from the other gradients) as follows:

(4) |

where given a function , denotes the indexes with the smallest values, and means that is among the closest gradients to . Bulyan in turn takes the aforementioned vectors, computes their coordinate-wise median and produces a gradient which coordinates are the average of the closest values to the median.

### 2.3 Byzantine Resilience

Intuitively, weak Byzantine resilience requires a to guarantee convergence despite the presence of Byzantine workers. It can be formally stated as follows.

###### Definition 1 (Weak Byzantine resilience).

We say that a ensures weak -Byzantine resilience if the sequence (Equation 2 in the main paper) converges almost surely to some where , despite the presence of Byzantine workers.

On the other hand, strong Byzantine resilience requires that this convergence does not lead to ”bad” optimums, and is related to more intricate problem of non-convex optimization, which, in the presence of Byzantine workers, is highly aggravated by the dimension of the problem as explained in what follows.

#### Specificity of non-convex optimization.

Non-convex optimization is one of the earliest established NP-hard problems [8]. In fact, many interesting but hard questions in machine learning boil down to one answer: ”because the cost function is not necessarily convex”.

In distributed machine learning, the non-convexity of the cost function creates two non-intuitive behaviours that are important to highlight.

(1) A ”mild” Byzantine worker can make the system converge faster. For instance, it has been reported several times in the literature that noise accelerates learning [8, 2]. This can be understood from the ”S” (stochasticity) of SGD: as (correct) workers cannot have a full picture of the surrounding landscape of the loss, they can only draw a sample at random and estimate the best direction based on that sample, which can be, and is probably biased compared to the true gradient. Moreover, due to non-convexity, even the true gradient might be leading to the local minima where the parameter server is. By providing a wrong direction (i.e. not the true gradient, or a correct stochastic estimation), a Byzantine worker whose resources cannot face the high-dimensional landscape of the loss, might end up providing a direction to get out of that local minima.

(2) Combined with high dimensional issues, non-convexity explains the need for strong Byzantine resilience. Unlike the ”mild” Byzantine worker, a strong adversary with more resources than the workers and the server, can see a larger picture and provide an attack that requires a stronger requirement. Namely, a requirement that would cut the leeway offered to an attacker in each dimension. Figure 2 provides an illustration.

This motivates the following formalization of strong Byzantine resilience.

###### Definition 2 (Strong Byzantine resilience).

We say that a ensures strong -Byzantine resilient if for every , there exists a correct gradient (i.e., computed by a non-Byzantine worker) s.t. . The the expectation is taken over the random samples ( in Equation 1)and denotes the coordinate of a vector .

#### Weak vs. strong Byzantine resilience.

To attack non-Byzantine resilient s such as averaging, it only takes the computation of an estimate of the gradient, which can be done in operations per round by a Byzantine worker. This attack is reasonably cheap: within the usual cost of the workload of other workers, , and the server, .

To attack weakly Byzantine-resilient s however, one needs to find the ’most legitimate but harmful vector possible’, i.e one that will (1) be selected by a weakly Byzantine-resilient , and (2) be misleading convergence (red arrow in Figure 1). To find this vector, an attacker has to first collect every correct worker’s vector (before they reach the server), and solve an optimization problem (by linear regression) to approximate this harmful but legitimate vector [6]. If the desired quality of the approximation is , the Byzantine worker would need at least operation to reach it with regression. This is a tight lower bound for a regression problem in dimensions with vectors [8]. In practice, if the required precision is of order , with workers and a neural network model of dimension , the cost of the attack becomes quickly prohibitive ( operations to be done in each step by the attacker).

To summarize, weak Byzantine resilience can be enough as a practical solution against attackers whose resources are comparable to the server’s. However, strong Byzantine resilience remains the only provable solution against attackers with significant resources.

For the sake of our theoretical analysis, we also recall the definition of –Byzantine resilience [1] (Definition 3). This definition is a sufficient condition (as proved in [1] based on [2]) for weak Byzantine resilience.Even-though the property of –Byzantine resilience is a sufficient, but not a necessary condition for (weak) Byzantine resilience, it has been so far used as the defacto standard [11, 1, 4] to guarantee (weak) Byzantine resilience for SGD. We will therefore follow this standard and require –Byzantine resilience from any that is plugged into multi-Bulyan, in particular, we will require it from multi-Krum. The theoretical analysis done in [6] guarantees that Bulyan inherits it.

Intuitively, Definition 3 states that the gradient aggregation rule produces an output vector that lives, on average (over random samples used by SGD), in the cone of angle around the true gradient. We simply call this the ”correct cone”.

###### Definition 3 (–Byzantine resilience (as in [1])).

Let be any angular value, and any integer . Let be any independent identically distributed random vectors in , , with . Let be any random vectors in , possibly dependent on the ’s. An aggregation rule is said to be -Byzantine resilient if, for any , vector

satisfies
(i) ^{4}^{4}4Having a scalar product that is lower bounded by this value guarantees that the of multi-Krum lives in the aformentioned cone. For a visualisation of this requirement, see the ball and inner triangle of Figure 3 and
(ii) for , is bounded above by
a linear combination of terms
with .

#### Choice of .

The properties of the existing Byzantine-resilient SGD algorithms all depend on one important parameter, i.e., the number of potentially Byzantine nodes . It is important to notice that denotes a contract between the designer of the fault-tolerant solution and the user of the solution (who implements a service on top of the solution and deploys it in a specific setting). As long as the number of Byzantine workers is less than , the solution is safe. Fixing an optimal value for is an orthogonal problem. For example, if daily failures in a data center are about 1%, would be a suggested choice to tune the algorithm, and suffer from only a 99% slowdown.

The performance (convergence time) of certain existing Byzantine-resilient SGD algorithms in a non-Byzantine environment is independent of the choice of . These algorithms do not exploit the full potential of the choice of . Modern large-scale systems are versatile and often undergo important structural changes while providing online services (e.g., addition or maintenance of certain worker nodes). Intuitively, there should be a fine granularity between the level of pessimism (i.e., value of ) and the performance of the SGD algorithm in the setting with no Byzantine failures.

## 3 multi-Krum: Weak Byzantine Resilience and Slowdown

Let be any integer greater than , any integer s.t and an integer s.t . Let .

We first prove the –Byzantine resilience of multi-Krum (Lemma 1), then prove its almost sure convergence (Lemma 2) based on that, which proves the weak Byzantine resilience of multi-Krum (Theorem 1).

In all what follows, expectations are taken over random samples used by correct workers to estimate the gradient, i.e the ”S” (stochasticity) that is inherent to SGD. It is worth noting that this analysis in expectation is not an average case analysis from the point of view of Byzantine fault tolerance. For instance, the Byzantine worker is always assumed to follow arbitrarily bad policies and the analysis is a worst-case one.

The Byzantine resilience proof (Lemma 1) relies on the following observation: given , and in particular ^{5}^{5}5The slowdown question is an incentive to take the highest value of among those that satisfy Byzantine resilience, in this case ., -Krum averages gradients that are all in the ”correct cone”, and a cone is a convex set, thus stable by averaging. The resulting vectors therefore also live in that cone. The angle of the cone will depend on a variable as in [1], the value of itself depends on . This is what enables us to use multi-Krum as the basis of our multi-Krum, unlike [1] where a restriction is made on .

The proof of Lemma 2 is the same as the one in [1] which itself draws on the rather classic analysis of SGD made by L.Bottou [2]. The key concepts are (1) a global confinement of the sequence of parameter vectors and (2) a bound on the statistical moments of the random sequence of estimators built by the of multi-Krum. As in [1, 2], reasonable assumptions are made on the cost function , those assumption are not restrictive and are common in practical machine learning.

###### Theorem 1 (Byzantine resilience and slowdown of multi-Krum).

Let be any integer s.t. . (i) multi-Krum has weak Byzantine resilience against failures. (ii) In the absence of Byzantine workers, multi-Krum has a slowdown (expressed in ratio with averaging) of .

###### Proof.

Proof of (i). To prove (i), we will require Lemma 1 and Lemma 2, then conclude by construction of multi-Krum as a multi-Krum algorithm with .

###### Lemma 1.

Let be any independent and identically distributed random -dimensional vectors s.t , with and . Let be any random vectors, possibly dependent on the ’s. If and , where

then the function of multi-Krum is -Byzantine resilient where is defined by

###### Proof.

Without loss of generality, we assume that the Byzantine vectors occupy the last positions in the list of arguments of multi-Krum, i.e., . An index is correct if it refers to a vector among . An index is Byzantine if it refers to a vector among . For each index (correct or Byzantine) , we denote by (resp. ) the number of correct (resp. Byzantine) indices such that (the notation we introduced in Section 3 when defining multi-Krum), i.e the number of workers, among the neighbors of that are correct (resp. Byzantine). We have

We focus first on the condition (i) of -Byzantine resilience. We determine an upper bound on the squared distance . Note that, for any correct , . We denote by the index of the worst scoring among the vectors chosen by the multi-Krum function, i.e one that ranks with the smallest score in Equation 5 of the main paper (Section 3).

where denotes the indicator function^{6}^{6}6 equals if the predicate is true, and otherwise..
We examine the case for some correct index .

We now examine the case for some Byzantine index . The fact that minimizes the score implies that for all correct indices

Then, for all correct indices

We focus on the term . Each correct process has neighbors, and non-neighbors. Thus there exists a correct worker which is farther from than any of the neighbors of . In particular, for each Byzantine index such that , . Whence

Putting everything back together, we obtain

By assumption, , i.e., belongs to a ball centered at with radius . This implies

To sum up, condition (i) of the -Byzantine resilience property holds. We now focus on condition (ii).

Denoting by a generic constant, when , we have for all correct indices

The second inequality comes from the equivalence of norms in finite dimension. Now

Since the ’s are independent, we finally obtain that is bounded above by a linear combination of terms of the form with . This completes the proof of condition (ii). ∎

###### Lemma 2.

Assume that (i) the cost function is three times differentiable with continuous derivatives, and is non-negative, ; (ii) the learning rates satisfy and ; (iii) the gradient estimator satisfies and for some constants ; (iv) there exists a constant such that for all

(v) finally, beyond a certain horizon, , there exist and such that

Then the sequence of gradients converges almost surely to zero.

###### Proof.

For the sake of simplicity, we write . Before proving the main claim of the proposition, we first show that the sequence is almost surely globally confined within the region .

#### (Global confinement).

Let where

Note that

(5) |

This becomes an equality when . Applying this inequality to yields

Let denote the -algebra encoding all the information up to round . Taking the conditional expectation with respect to yields

Thanks to condition (ii) of -Byzantine resilience, and the assumption on the first four moments of , there exist positive constants such that

Thus, there exist positive constant such that

When , the first term of the right hand side is null because . When , this first term is negative because (see Figure 3)

Hence

We define two auxiliary sequences

Note that the sequence converges because . Then

Consider the indicator of the positive variations of the left-hand side

Then

The right-hand side of the previous inequality is the summand of a convergent series. By the quasi-martingale convergence theorem [9], this shows that the sequence converges almost surely, which in turn shows that the sequence converges almost surely, .

Let us assume that . When is large enough, this implies that and are greater than . Inequality 5 becomes an equality, which implies that the following infinite sum converges almost surely

Note that the sequence converges to a positive value. In the region , we have

This contradicts the fact that . Therefore, the sequence converges to zero. This convergence implies that the sequence is bounded, i.e., the vector is confined in a bounded region containing the origin. As a consequence, any continuous function of is also bounded, such as, e.g., , and all the derivatives of the cost function . In the sequel, positive constants etc…are introduced whenever such a bound is used.

#### (Convergence).

We proceed to show that the gradient converges almost surely to zero. We define

Using a first-order Taylor expansion and bounding the second derivative with , we obtain

Therefore

(6) |

By the properties of -Byzantine resiliency, this implies

which in turn implies that the positive variations of are also bounded

The right-hand side is the summand of a convergent infinite sum. By the quasi-martingale convergence theorem, the sequence converges almost surely, .

Taking the expectation of Inequality 6, and summing on , the convergence of implies that

We now define

Using a Taylor expansion, as demonstrated for the variations of , we obtain

Taking the conditional expectation, and bounding the second derivatives by ,

The positive expected variations of are bounded

The two terms on the right-hand side are the summands of convergent infinite series. By the quasi-martingale convergence theorem, this shows that converges almost surely.

We have

This implies that the following infinite series converge almost surely

Since converges almost surely, and the series diverges, we conclude that the sequence converges almost surely to zero. ∎

We conclude the proof of (i) by recalling the definition of multi-Krum, as the instance of with .

#### Proof of (ii).

(ii) is a consequence of the fact that -Krum is the average of estimators of the gradient. In the absence of Byzantine workers, all those estimators will not only be from the ”correct cone”, but from correct workers (Byzantine workers can also be in the correct cone, but in this case there are none). As SGD converges in , where is the number of used estimators of the gradient, the slowdown result follows. ∎

## 4 multi-Bulyan: Strong Byzantine Resilience and Slowdown

Let be any integer greater than , any integer s.t and an integer s.t . Let .

###### Theorem 2 (Byzantine resilience and slowdown of multi-Bulyan).

(i) multi-Bulyan provides strong Byzantine resilience against failures. (ii) In the absence of Byzantine workers, multi-Bulyan has a slowdown (expressed in ratio with averaging) of .

###### Proof.

If the number of iterations over multi-Krum is , then the leeway, defined by the coordinate-wise distance between the output of Bulyan and a correct gradient is upper bounded by . This is due to the fact that Bulyan relies on a component-wise median, that, as proven in [6] guarantees this bound. The proof is then a direct consequence of Theorem 1 and the properties of Bulyan [6]. ∎

## References

- [1] Blanchard, P., El Mhamdi, E. M., Guerraoui, R., and Stainer, J. Machine learning with adversaries: Byzantine tolerant gradient descent. In NIPS (2017), pp. 118–128.
- [2] Bottou, L. Online learning and stochastic approximations. Online learning in neural networks 17, 9 (1998), 142.
- [3] Damaskinos, G., El-Mhamdi, E.-M., Guerraoui, R., Guirguis, A., and Rouault, S. Aggregathor: : Byzantine machine learning via robust gradient aggregation. In Proceedings of the SysML Conference (2019).
- [4] Damaskinos, G., El Mhamdi, E. M., Guerraoui, R., Patra, R., Taziki, M., et al. Asynchronous Byzantine machine learning (the case of sgd). In Proceedings of the 35th International Conference on Machine Learning (Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018), vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 1153–1162.
- [5] Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., et al. Large scale distributed deep networks. In NIPS (2012), pp. 1223–1231.
- [6] El Mhamdi, E. M., Guerraoui, R., and Rouault, S. The hidden vulnerability of distributed learning in Byzantium. In Proceedings of the 35th International Conference on Machine Learning (Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018), vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 3521–3530.
- [7] Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).
- [8] Haykin, S. S. Neural networks and learning machines, vol. 3. Pearson Upper Saddle River, NJ, USA:, 2009.
- [9] Métivier, M. Semi-Martingales. Walter de Gruyter, 1983.
- [10] Shalev-Shwartz, S., and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- [11] Xie, C., Koyejo, O., and Gupta, I. Generalized Byzantine-tolerant sgd. arXiv preprint arXiv:1802.10116 (2018).