ByRDiE: Byzantine-resilient Distributed Coordinate Descent for Decentralized Learning

# ByRDiE: Byzantine-resilient distributed coordinate descent for decentralized learning

## Abstract

Distributed machine learning algorithms enable learning of models from datasets that are distributed over a network without gathering the data at a centralized location. While efficient distributed algorithms have been developed under the assumption of faultless networks, failures that can render these algorithms nonfunctional occur frequently in the real world. This paper focuses on the problem of Byzantine failures, which are the hardest to safeguard against in distributed algorithms. While Byzantine fault tolerance has a rich history, existing work does not translate into efficient and practical algorithms for high-dimensional distributed learning. In this paper, two variants of an algorithm termed Byzantine-resilient distributed coordinate descent (ByRDiE) are developed and analyzed that enable distributed learning in the presence of Byzantine failures. Theoretical analysis and numerical experiments presented in the paper highlight the usefulness of ByRDiE for high-dimensional distributed learning in the presence of Byzantine failures.

Byzantine failure, consensus, distributed optimization, empirical risk minimization, machine learning

## I Introduction

One of the fundamental goals in machine learning is to learn a model that minimizes the statistical risk. This is typically accomplished through stochastic optimization techniques, with the underlying principle referred to as empirical risk minimization (ERM) [2, 3, 4]. The ERM principle involves the use of a training dataset and tools from optimization theory. Traditionally, training data have been assumed available at a centralized location. Many recent applications of machine learning, however, involve the use of a dataset that is either distributed across different locations (e.g., the Internet of Things) or that cannot be processed at a single machine due to its size (e.g., social network data). The ERM framework in this setting of distributed training data is often referred to as decentralized or distributed learning [5, 6].

While there exist excellent works that solve the problem of distributed learning, all these works make a simplified assumption that all nodes in the network function as expected. Unfortunately, this assumption does not always hold true in practice; examples include cyber attacks, malfunctioning equipments and undetected failures [7, 8]. When a node arbitrarily deviates from its intended behavior, it is termed to have undergone Byzantine failure [9]. While Byzantine failures are hard to detect in general, they can easily jeopardize the operation of the whole network [10, 11, 12].

In particular, with just a simple strategy, one can show that a single Byzantine node in the network can lead to failures of most state-of-the-art distributed learning algorithms [13, 14, 15]. The main contribution of this paper is to introduce two variants of an algorithm that complete the distributed learning task in the presence of Byzantine failures in the network.

### I-a Related works

To achieve the goal of distributed learning, one usually sets up a distributed optimization problem by defining and minimizing a (regularized) loss function on the training data of each node. The resulting problem can then be solved by distributed optimization algorithms. Several types of distributed algorithms have been introduced in the past [16, 17, 18, 19, 20, 21, 22, 23, 24]. The most common of these are gradient-based methods [16, 17, 18], which usually have low local computational complexity. Another type includes augmented Lagrangian-based methods [19, 20, 21], which iteratively update the primal and dual variables. These methods often require the ability to solve locally an optimization subproblem. A third type includes second-order distributed methods [22, 23], which tend to have high computational and/or communications cost. While any of these algorithms can be used for distributed learning, they all make the assumption that there are no failures in the network.

Byzantine-resilient algorithms for a variety of problems have been studied extensively over the years [9, 25, 26, 27, 28]. Byzantine-resilient algorithms for scalar averaging distributed consensus were studied in [29]. The algorithms proposed in [13, 14] extend this work from scalar consensus to scalar-valued distributed optimization, but they cannot be used in vector settings. In the context of distributed learning, [15] introduces a method to implement distributed support vector machine (SVM) under Byzantine failures, but it does not generalize to other learning problems. A recent work [30] solves a vector-valued distributed learning problem under Byzantine failures, but it requires a central processing center. Because of this, it is not applicable in fully distributed settings.1

### I-B Our contributions

There are several limitations of works like [29, 13, 14, 15, 30, 37, 38]. One of the limitations is that the proposed algorithms pursue the minimizer of a convex combination of local empirical risk functions. This minimizer is usually different from the minimizer of the exact average of local loss functions. As such, there are no guarantees that the outputs of these algorithms result in either the minimum empirical risk or the minimum statistical risk. Another limitation is that, when forced to work with vectors, existing Byzantine-resilient algorithms require a strong assumption on the network topology [39]. Specifically, the smallest size of neighborhood of each node in the vector setting depends linearly on the dimensionality of the problem. This is impractical for most learning problems since the dimensionality of the training samples is usually much larger than the size of the neighborhood of each node.

In contrast to prior works, our work has two main contributions. First, we propose two variants of an algorithm that scale well with the dimensionality of distributed learning problems. Second, under the assumption of independent and identically distributed training samples, we provide theoretical guarantees that the outputs of the proposed algorithms will lead to the minimum statistical risk with high probability. Specifically, we show that the outputs of our algorithms converge to the minimizer of the statistical risk faster than using only local information by a factor that is defined in the sequel.

Notation: Given a vector and a matrix , and denote their -th and -th element, respectively. We use and to denote the - and -norms of , respectively, to denote the vector of all ones, and to denote the identity matrix. Given a set, denotes its cardinality, while denotes the transpose operation. We use to denote the vector formed by replacing the -th element of with . We use and to represent element-wise inequalities. Finally, we use to denote the gradient of a function with respect to .

Organization: The rest of this paper is organized as follows. Section II gives the problem formulation. Section III discusses the two variants of the proposed algorithm along with theoretical guarantees. Numerical results are provided in Section IV, while Section V concludes the paper.

## Ii Problem formulation

Given a network in which nodes have access to local training data, our goal is to learn a machine learning model from the distributed data, even in the presence of Byzantine failures. We begin with a model of our basic problem, which will be followed with a model for Byzantine failures.

Consider a network of nodes, expressed as a directed, static graph . Here, the set represents nodes in the network, while the set of edges represents communication links between different nodes. Specifically, if and only if node can receive information from node . Each node has access only to a local training set . Let represent the training features satisfying for some constant and be the corresponding label. For classification, , while for regression.2 We assume that the training samples are independent and identically distributed (i.i.d.) and drawn from an unknown distribution , i.e., . For simplicity, we assume that the cardinalities of the local training sets are the same, i.e., . The generalization to the case when ’s are not equal is trivial.

In machine learning, one would ideally like to collect all the data into one set and perform centralized training on . The goal in that case is to learn a function that reliably maps to . One popular mapping is (sometimes this is defined as , which can be transformed into by adding one more dimension to ). To find a “good” , one first defines a loss function , where the value of loss function increases when the difference between the mapping of and increases. To avoid over fitting, a regularizer is often added to the loss function. Then one can solve for by statistically minimizing a regularized loss function . The regularized is often referred to as risk function. In this paper, we focus on the class of convex differentiable loss functions and strictly convex and smooth regularizers3. Examples include square loss , square hinge loss , logistic loss and . We also assume the gradient of the loss function is -Lipschitz continuous. Since is smooth, we formally state the Lipschitz assumption as follows.

###### Assumption 1.

The risk function satisfies .

Assumption 1 implies the risk function itself is also Lipschitz, i.e., [40].

Using to denote the feasible set for , centralized machine learning focuses on learning the “best” mapping as . In this work, we take to be for a constant . Note that solving an unconstrained machine learning problem is equivalent to solving the constrained problem with a large enough . Extensions to other convex constraint sets can be carried out with slight modifications of our analysis, but would not be pursued in this work. Since is unknown, one can not solve for directly. A broadly adopted alternative then is to minimize the empirical risk . In particular, the minimizer of the empirical risk can be shown to converge to with high probability [3]. We now make another assumption.

###### Assumption 2.

For any , the loss function is bounded almost surely over all training samples, i.e., , .

Note that Assumption 2 would be satisfied for datasets with finite-valued training samples because of the Lipschitz continuity of and the compactness of .

In many distributed learning problems, training data cannot be made available at a single location. This then requires learning in a distributed fashion, which can be done by employing distributed optimization techniques. The main idea of distributed optimization-based learning is to minimize the average of local empirical risks, i.e., . To achieve this goal, we need nodes to cooperate with each other by communicating over edges. Specifically, define the neighborhood of node as . We say that node is a neighbor of node if . Distributed learning algorithms proceed iteratively. In each iteration of the algorithm, node is expected to accomplish two tasks: (1) update a local variable according to some rule , and (2) broadcast the updated local variable to other nodes, where node can receive the broadcasted information from node only if .

### Ii-a Byzantine failure model

While distributed learning via message passing is well understood [21, 41], existing protocols require all nodes to operate as intended. In contrast, the main assumption in this paper is that some of the nodes in the network can undergo Byzantine failures, formally defined as follows.

###### Definition 1.

A node is said to be Byzantine if during any iteration, it either updates its local variable using an update function or it broadcasts some value other than the intended update to its neighbors.

In this paper, we assume there are Byzantine nodes in the network. Knowing the exact value of , however, is not necessary. One can, for example, set to be an upper bound on the number of Byzantine nodes. Let denote the set of nonfaulty nodes. Without loss of generality, we assume the nonfaulty nodes are labeled from 1 to . We now provide some definitions and an assumption that are common in the literature; see, e.g., [13, 14].

###### Definition 2.

A subgraph of is called a reduced graph if it is generated by () removing all Byzantine nodes along with their incoming and outgoing edges, and () removing additionally up to incoming edges from each nonfaulty node.

###### Definition 3.

A “source component” of a graph is a collection of nodes such that each node in the source component has a directed path to every other node in the graph.

###### Assumption 3.

All reduced graphs generated from contain a source component of cardinality at least .

The purpose of Assumption 3 is to ensure there is enough redundancy in the graph to tolerate Byzantine failures. Note that the total number of different reduced graphs one can generate from is finite as long as is finite. So, in theory, Assumption 3 can be certified for any graph. But efficient certification of this assumption remains an open problem. In the case of Erdös–Rényi graphs used in our experiments, however, we have observed that Assumption 3 is typically satisfied whenever the ratio of the average incoming degree of the graph and the number of Byzantine nodes is high enough.

The goal of this paper is to develop a Byzantine fault-tolerant algorithm for distributed learning under Assumptions 13. In particular, under the assumption of at most Byzantine nodes in the network, we need to accomplish the following: () achieve consensus among nonfaulty nodes, i.e., as the number of iterations ; () ensure that as the sample size .

## Iii Byzantine-resilient distributed coordinate descent for decentralized learning

In distributed learning, one would ideally like to solve the empirical risk minimization (ERM) problem

 wopt=argminw∈WR(w)+1MNM∑j=1N∑n=1ℓ(w,(xjn,yjn)) (1)

at each node and show that as . However, we know from prior work [42] that this is infeasible when . Nonetheless, we establish in the following that distributed strategies based on coordinate descent algorithms [43] can still be used to solve a variant of (1) at nonfaulty nodes and guarantee that the solutions converge to the minimizer of the statistical risk for the case of i.i.d. training data. We refer to our general approach as Byzantine-Resilient Distributed coordinate dEscent (ByRDiE), which is based on key insights gleaned from two separate lines of prior work. First, it is known that certain types of scalar-valued distributed optimization problems can be inexactly solved in the presence of Byzantine failures [13, 14]. Second, coordinate descent algorithms break down vector-valued optimization problems into a sequence of scalar-valued problems [43]. The algorithmic aspects of ByRDiE leverage these insights, while the theoretical aspects leverage tools from statistical learning theory [3]. In the following, we discuss and analyze two specific variants of ByRDiE.

### Iii-a Algorithm I: ByRDiE with exact line search

The first variant of ByRDiE involves splitting the ERM problem (1) into one-dimensional subproblems using coordinate descent and then solving each scalar-valued subproblem using the Byzantine-resilient approach described in [13]. We term this variant “ByRDiE with exact line search” (ByRDiE-I), with the exact implementation detailed in Algorithm 1. This implementation can be broken into an outer loop (Step 3) and an inner loop (Step 5). The outer loop is the coordinate descent loop, which breaks the vector-valued optimization problem in each iteration into scalar-valued subproblems. The inner loop solves a scalar-valued optimization problem in each iteration and ensures resilience to Byzantine failures. As explained later in this section, the inner loop will return the minimizer of a convex combination (among the nonfaulty nodes) of local empirical risks with respect to each coordinate of . We will show that the output of the inner loop converges in probability to the minimizer of the statistical risk with respect to each coordinate as . Then we will analyze the effects of the inner loop on the coordinate descent loop.

#### Algorithmic Details

We assume the total number of iterations for coordinate descent are specified during initialization. The number of iterations for each scalar-valued subproblem within coordinate descent is assumed to be a function of that satisfies as . We use to denote the -th element of at the -th iteration of the coordinate descent loop and the -th iteration of the -th subproblem (coordinate). Without loss of generality, we initialize .

We now fix some and , and focus on the implementation of the inner loop (Step 5). Every node has some at the start of the inner loop (). During each iteration of this loop, all (nonfaulty) nodes engage in the following: broadcast, screening, and update. In the broadcast step (Step 7), all nodes broadcast ’s and each node receives . During this step, a node can receive values from both nonfaulty and Byzantine neighbors. The main idea of the screening step (Step 8) is to reject values at node that are either “too large” or “too small” so that the values being used for update by node in each iteration will be upper and lower bounded by a set of values generated by nonfaulty nodes. To this end, we partition into 3 subsets , and , which are defined as following:

 Nsj(r,k,t) =argminX:X⊂Nj,|X|=b∑i∈X[wri(t)]k, (2) Nlj(r,k,t) =argmaxX:X⊂Nj,|X|=b∑i∈X[wri(t)]k, (3) N∗j(r,k,t) =Nj∖Nsj(r,k,t)∖Nlj(r,k,t). (4)

The step is called screening because node only uses ’s from to update its local variable. Note that there might still be ’s received from Byzantine nodes in . We will see later, however, that this does not effect the workings of the overall algorithm. The final step of the inner loop is the update step (Step 9). Using to denote the -th element of , we can write this update step as follows:

 [wrj(t+1)]k =1|Nj|−2b+1∑i∈N∗j(r,k,t)∪{j}[wri(t)]k −ρ(t)[∇ˆf(wrj(t),Sj)]k, (5)

where are stepsizes that satisfy , and . An iteration of the coordinate descent loop is considered complete once all subproblems within the loop are solved. The local variable at each node at the end of this iteration is then denoted by:

 ¯wj(r)=[[wrj(T(¯r))]1[wrj(T(¯r))]2...[wrj(T(¯r))]P]T. (6)

We also express the output of the whole algorithm as .

#### Theoretical Guarantees

We now turn our attention to theoretical guarantees for ByRDiE-I. The first result in this regard simply follows from [13, Theorem 2] (also, see [44]).

###### Lemma 1.

Let Assumptions 1 and 3 hold, and let the -th subproblem of the coordinate descent loop in iteration be initialized with some . Then, as , for some such that .

Lemma 1 shows that each subproblem of the coordinate descent loop in ByRDiE-I asymptotically converges to the minimizer of some convex combination of local empirical risk functions of the nonfaulty nodes with respect to each coordinate. In addition, Lemma 1 guarantees that consensus is asymptotically achieved among the nonfaulty nodes at the end of each coordinate descent loop. In other words, when nonfaulty nodes begin a coordinate descent subproblem with identical local estimates and , they are guaranteed to begin the next subproblem with identical local estimates.

We now fix and use to denote the identical initial local estimates at nonfaulty nodes at the beginning of -th subproblem of the coordinate descent loop in the -th iteration. Next, we define and for as

 hrk(~wrk,w′) :=f(~wrk|[~wrk]k=w′,S),and (7) Hrk(~wrk,w′) :=∑j∈J′αj(r,k)ˆf(~wrk|[~wrk]k=w′,Sj) (8)

for some such that . Note that is strictly convex and Lipschitz. Now define

 ˆwrk :=argminw′∈RHrk(~wrk,w′),and (9) ˙wrk :=argminw′∈RE(x,y)∼D[hrk(~wrk,w′)]. (10)

We then have the following result that relates to .

###### Lemma 2.

Fix any and let Assumption 2 be true. Then, with probability at least , we have

 E(x,y)∼D[hrk(~wrk,ˆwrk)]−E(x,y)∼D[hrk(~wrk,˙wrk)] ≤ ⎷2C2ln(4δ)∑j∈J′α2j(r,k)N. (11)

The proof of this lemma is provided in Appendix A. In words, Lemma 2 states that the minimizer of a convex combination of the local empirical risk functions with respect to one coordinate at nonfaulty nodes (cf. (9)) achieves, with high probability, almost the same statistical risk with respect to one coordinate as that obtained using the corresponding statistical risk minimizer . This suggests that, even though ByRDiE-I cannot exactly solve the same subproblem as traditional coordinate descent (cf. [43] and Lemma 1) due to the presence of Byzantine nodes, the final solution of ByRDiE-I might still be able to match the performance of the statistical risk minimizer .

Before proceeding with a formal statement of such a result, it is instructive to examine the interplay between Lemma 1 and Lemma 2. Trivially, the factor appearing in Lemma 2 (due to Lemma 1) can be upper and lower bounded as . In particular, in the absence of Byzantine nodes in the network (i.e., ), we will have , resulting in with high probability. This is the fastest (coordinate-wise) rate of convergence for ByRDiE-I, which is of course unattainable for . In contrast, the slowest (coordinate-wise) rate of convergence for ByRDiE-I corresponds to the case of (i.e., ), resulting in with high probability. Lemma 2, together with Lemma 1, tells us that while we can no longer achieve the fastest convergence because of Byzantine failures, the coordinate-wise convergence rate still benefits from cooperation among the nonfaulty nodes by a factor .

Together, Lemma 1 and Lemma 2 help understand coordinate-wise convergence behavior of ByRDiE-I for a fixed . To derive a result that extends this idea simultaneously to all coordinates, we first need the following lemma.

###### Lemma 3.

Let the number of coordinates be fixed and let be any (arbitrarily large) positive integer. Further, let be any arbitrary collection of coefficients that satisfy and . Then, as long as Assumptions 1 and 2 hold, we have

 minw′∈RHrk(~wrk,w′)N−→minw′∈RE(x,y)∼D[hrk(~wrk,w′)] (12)

in probability simultaneously for coordinate descent iterations and coordinates .

The proof of this lemma, which is based on Lemma 2, is provided in Appendix B. We now make a couple of remarks in relation to Lemma 3. First, the proof of Lemma 3 can be leveraged to describe the rate at which approaches . Specifically, let us define . Then Lemma 3 can be interpreted to mean that, as long as

 N=Ω(¯a2C2ϵ−2Plog(144CL′Γ|J′|Pϵ−2)), (13)

with high probability. Second, Lemma 3 together with (9) and (10) imply in probability simultaneously for all pairs . We are now ready to state the main result for ByRDiE-I.

###### Theorem 1.

Let Assumptions 12 and 3 hold for ByRDiE-I. Then all non-faulty nodes in the network achieve consensus after each coordinate descent iteration , i.e., . Further, we also have in probability

###### Proof.

Recall from the preceding discussion that and . It then trivially follows from Lemma 1 that all non-faulty nodes in the network achieve consensus after each coordinate descent iteration to .

In order to prove the second part of this theorem, we first fix any arbitrary and then establish that converges to zero as a function of and . To this end, let us begin with any coordinate descent iteration and dimension . Then we recall from the earlier discussion that and, starting at Step 3, . Next, ,

 E(x,y)∼D[hrk(~wrk,˙wrk)]\lx@stackrel(a)≤E(x,y)∼D[hrk(~wrk,w′)] \lx@stackrel(b)≤E(x,y)∼D[hrk(~wrk,ˆwr−1k)]+L2|w′−ˆwr−1k|2 +[∇E(x,y)∼D[hrk(~wrk,ˆwr−1k)]]k(w′−ˆwr−1k), (14)

where () follows from the definition of and () follows from Assumption 1. Plugging in (III-A2), we obtain

 E(x,y)∼D[hrk(~wrk,ˆwr−1k)]−E(x,y)∼D[hrk(~wrk,˙wrk)]≥ 12L[∇E(x,y)∼D[hrk(~wrk,ˆwr−1k)]]2k. (15)

Next, fix an arbitrary and note from Lemma 3 that we have ,

 ∣∣E(x,y)∼D[hrk(~wrk,ˆwrk)]−E(x,y)∼D[hrk(~wrk,˙wrk)]∣∣≤ϵ′ (16)

with probability that converges to one as a function of and . We now condition on the event described by (16) and assume that . It then follows from (III-A2) that

 E(x,y)∼D[hrk(~wrk,ˆwr−1k)]−E(x,y)∼D[hrk(~wrk,ˆwrk)]≥ϵ′. (17)

We see from (17) that is a strictly monotonic function of . In particular, each (outer) iteration of ByRDiE-I reduces the coordinate-wise statistical risk by at least until the gradient of coordinate , , falls below a threshold: . Once that happens for all , we obtain for large enough values of . Afterward, convexity of implies

 E(x,y)∼Df(w∗,(x,y))≥E(x,y)∼Df(~wrk,(x,y)) +∇E(x,y)∼D[f(~wrk,(x,y)]T(w∗−~wrk). (18)

Combining all these facts together, and using the Cauchy–Schwarz inequality, we obtain

 E(x,y)∼Df(~wrk,(x,y))−E(x,y)∼Df(w∗,(x,y)) ≤∥∇E(x,y)∼D[f(~wrk,(x,y)]∥∥~wrk−w∗∥<2PLΓϵ′. (19)

Finally, setting in (III-A2) and removing conditioning on the event (16) give us the desired bound. We conclude by noting that the same result trivially follows when . ∎

Recall that both the empirical and the statistical risks are strictly convex. Therefore Theorem 1 can also be interpreted as in probability due to the uniqueness of the minimum of strictly convex functions.

### Iii-B Algorithm II: ByRDiE with switching coordinate

ByRDiE-I is the first algorithm in the literature that provably converges to the statistical risk minimizer in fully distributed settings in the presence of Byzantine failures. Nonetheless, it is communications heavy because of its two time-scale nature: each ByRDiE-I iteration (indexed by ) requires several rounds of communications per coordinate (indexed by ). This is because ByRDiE-I is leveraging coordinate descent with exact line search in each dimension. However, coordinate descent can also be implemented by taking only one step in the direction of descent in a dimension and then switching to another dimension [43]. In this section, we propose and analyze another variant of ByRDiE that is based on this implementation of coordinate descent. We term this variant “ByRDiE with switching coordinate” (ByRDiE-II), with the exact implementation detailed in Algorithm 2.

#### Algorithmic Details

It can be seen from Algorithm 2 that ByRDiE-II is a single time-scale algorithm, as it requires only one round of communications per coordinate in each iteration (indexed by ). In every iteration , ByRDiE-II sequentially updates each coordinate (indexed by ) from to .4 In the following, we use to denote the optimization variable, , at node after the -th coordinate has been updated within the -th iteration of the algorithm, while denotes at the end of the -th iteration, i.e., . Comparing Algorithm 2 with Algorithm 1, we observe that—apart from the single time-scale nature of ByRDiE-II—both the algorithms operate in a similar fashion. Specifically, without loss of generality, (nonfaulty) nodes initialize with and engage in broadcast (Step 6), screen (Step 7), and update (Step 8) in each iteration of ByRDiE-II. In particular, the screening procedure in ByRDiE-II also involves partitioning into 3 subsets , and as follows:

 Nsj(r,k) =argminX:X⊂Nj,|X|=b∑i∈X[wki(r)]k, (20) Nlj(r,k) =argmaxX:X⊂Nj,|X|=b∑i∈X[wki(r)]k, (21) N∗j(r,k) =Nj∖Nsj(r,k)∖Nlj(r,k). (22)

Finally, each node updates its local variable as follows:

 [wkj(r+1)]k =1|Nj|−2b+1∑i∈N∗j(r,k)∪{j}[wki(r)]k −ρ(t)[∇ˆf(wkj(r),Sj)]k, (23)

#### Theoretical Guarantees

While ByRDiE-I and ByRDiE-II share algorithmic similarities, the analytical tools required to guarantee that ByRDiE-II, despite its one time-scale nature, results in the statistical risk minimizer at nonfaulty nodes are quite different. In particular, theoretical guarantees for ByRDiE-II require establishing the following two claims: () local optimization variables at the nonfaulty nodes reach consensus on the same vector , which will be defined later, as ; and () in probability.

We begin our discussion with the first claim by leveraging the insights developed in [44]. Since ByRDiE-II is based on coordinate descent, it suffices to focus on any arbitrary dimension and show that achieve consensus as . We now define a vector such that (with a slight abuse of notation)