1 Introduction
Abstract

In this paper, we consider the problem of federated (or decentralized) learning using ADMM with multiple agents. We consider a scenario where a certain fraction of agents (referred to as Byzantines) provide falsified data to the system. In this context, we study the convergence behavior of the decentralized ADMM algorithm. We show that ADMM converges linearly to a neighborhood of the solution to the problem under certain conditions. We next provide guidelines for network structure design to achieve faster convergence. Next, we provide necessary conditions on the falsified updates for exact convergence to the true solution. To tackle the data falsification problem, we propose a robust variant of ADMM. We also provide simulation results to validate the analysis and show the resilience of the proposed algorithm to Byzantines.

 

Robust Federated Learning Using ADMM in the Presence of Data Falsifying Byzantines


 


Qunwei Li    Bhavya Kailkhura    Ryan Goldhahn    Priyadip Ray    Pramod K. Varshney

Syracuse University                        Lawrence Livermore National Laboratory

1 Introduction

Many machine learning and statistics problems fit into the general framework where a finite-sum structure of functions is to be optimized. In general, the problem is formulated as

(1)

The problem structure in (1) covers collaborative autonomous inference in statistics and linear/logistic regression, support vector machines, and deep neural networks in machine learning. Due to the emergence of the big data era and associated sizes of datasets, solving problem (1) on a single node (or agent) is often impossible, as storing the entire dataset on a single node becomes infeasible. This gives rise to the federated optimization setting [9], in which the training data for the problem is stored in a distributed fashion across a number of interconnected nodes and the optimization problem is solved collectively by the cluster of nodes. However, distributing computation over several nodes induces a higher risk of failures, including communication noise, crashes and computation errors. Furthermore, some nodes, often referred to as Byzantine nodes, may intentionally inject false data to gain unfair advantage or degrade the system performance. While Byzantines (originally proposed in [10]) may, in general, refer to many types of unwanted behavior, our focus in this paper is on data-falsification. Data falsifying Byzantines can easily prevent the convergence of the federated learning algorithm [8, 9].

There exist several decentralized optimization methods for solving (1), including belief propagation [15], distributed subgradient descent algorithms [13], dual averaging methods [4], and the alternating direction method of multipliers (ADMM) [2]. Among these, ADMM has drawn significant attention, as it is well suited for distributed optimization and demonstrates fast convergence in many applications [19, 17]. More specifically, ADMM was found to converge linearly for a large class of problems [7]. In [16], linear convergence rate has also been established for decentralized ADMM. Recently, the performance analysis of ADMM in the presence of inexactness in the updates has received some attention [1, 20, 18, 14, 6, 5]. Most relevant to our work, [3] studies the inexact ADMM algorithm for the decentralized consensus problem. The authors in [11] tried to study the scenario where the error occurs in the ADMM update . They considered the occurrence of the error in the -update step in (7), but failed to consider it in the -update step. However, most of the aforementioned papers consider that the inexactness occurs in an intermediate step of proximal mapping in one ADMM iteration, which is very limited and different from what we have studied in our paper. The focus of our work is on Byzantine falsification where Byzantines have a large degree of freedom and can falsify any algorithm parameters without abiding to an error model. Thus, we consider a general falsification model where the inexactness occurs in the update after one ADMM iteration. Also, our model can be seen as an extension of the aforementioned works.

Our main contributions can be summarized as follows.

  • First, we analyze the convergence behavior of ADMM for the decentralized consensus optimization problem with data-falsification errors in updates. A general performance guarantee is established, with respect to the distance between the update and the solution of the problem .

  • Second, we show that ADMM converges linearly to a neighborhood of the solution if certain conditions involving the network topology, the properties of the objective function, and the algorithm parameter, are satisfied. Guidelines are developed for network structure design to achieve faster convergence.

  • Third, we give several conditions on the data-falsification errors to obtain exact convergence to the solution.

  • Finally, to tackle the data-falsification errors, a robust variant of ADMM is proposed. This scheme relies upon a node identity profiling approach to locate the malicious Byzantine nodes.

1.1 Notations

Non-bold characters will be used for scalars, e.g., the algorithm parameter ; bold lower-case characters will be used for column vectors, e.g., the update ; bold, upper-case characters will be used for matrices, e.g., the extended signless Laplacian matrix ; and calligraphic upper-case characters will be used for sets, e.g., the set of arcs .

For a positive semidefinite matrix , define that is the nonzero smallest eigenvalue of matrix and that is the nonzero largest eigenvalue.

In particular, denotes the set of real numbers; denotes the -dimensional Euclidean space; and denotes the set of all real matrices.

2 Problem Formulation

2.1 Federated Learning with ADMM

Consider a network consisting of agents bidirectionally connected by edges (and thus arcs). We can describe the network as a symmetric directed graph or an undirected graph , where is the set of vertices with cardinality , is the set of arcs with , and is the set of edges with . In a federated setup, a local agent generates updates individually (by solving a local optimization problem) and communicates with its neighbors to reach a network-wide common minimizer.

More specifically, the federated learning problem, can be formulated as follows

(2)

Here is the local copy of the common optimization variable at agent and is an auxiliary variable imposing the consensus constraint on neighboring agents and . In the constraints, are separable when are fixed, and vice versa. Obviously (3) is equivalent to (1) when the network is connected.

Defining as a vector concatenating all , as a vector concatenating all , and , (3) can be written in a matrix form as

(3)

where , which fits the form of (2), and is amenable to be solved by ADMM. Here ; are both composed of blocks of matrices. If and is the th block of , then the th block of and the th block of are identity matrices ; otherwise the corresponding blocks are zero matrices . Also,we have with being a identity matrix.

We define the following matrices: and . Let be a block diagonal matrix with its th block being the degree of agent multiplying and other blocks being , , , and we know . These matrices are related to the underlying network topology. With regard to the undirected graph , and are the extended unoriented and oriented incidence matrices, respectively; and are the extended signless and signed Laplacian matrices, respectively; and is the extended degree matrix. By “extended”, we mean replacing every by , by , and by in the original definitions of these matrices.

2.2 Decentralized ADMM with Byzantines

The iterative updates of ADMM algorithm are given by [16]

(4)

The updates in (4) are distributed at the agents. Note that where is the local solution of agent and where is the local Lagrange multiplier of agent . Recalling the definitions of , and , (4) results in the update of agent by

(5)

where denotes the set of neighbors of agent . The algorithm is fully decentralized since the updates of and only rely on local and neighboring information.

Consider the case where a fraction of the nodes generate erroneous updates, and the corresponding nodes are termed as Byzantines. Assume that the true update goes through data falsifying manipulations at the Byzantines, and the outcome is modeled as , which is denoted as . The corresponding algorithm for the data-falsification Byzantine case is

(6)

For a clearer presentation, we will use the following form of iteration for our analysis

(7)

Compared with the update steps in (4), is replaced by the erroneous update in the first step, and is replaced by in the second step.

2.3 Problem Assumptions and Notations

2.4 Assumptions

We give two definitions that will be used for the cost function.

Definition 1 (L-smoothness).

A function is L-smooth if there is a constant such that

(8)

Note that such an assumption is very common in the analysis of first-order optimization methods. From the definition, we can see that a function being -smooth also means that the gradient is -Lipschitz continuous.

Definition 2 (-strongly convex).

A function is -strongly convex if there is a constant , such that

(9)

The constant measures how convex a function is. In particular, the larger that value of , the more convex is.

We adopt the following assumptions for the problem throughout our analysis.

Assumption 1.
  1. Function is proper and .

  2. Function is continuously differentiable and -smooth.

  3. Function is -strongly convex.

3 Convergence Analysis

To effectively present the convergence results111Proofs of the theoretical analysis are provided in the supplementary material., we first give a few notations and definitions. Let . Specifically, let , where is the singular value decomposition of the positive semidefinite matrix . We also construct a new auxiliary sequence . This sequence is not physically generated, but is used only for our analysis. Let , where denotes an optimal solution to the problem.

Define the auxiliary vector and matrix as

Theorem 1.

There exists such that

(10)

with

(11)

where

(12)

and

(13)

with quantities , , and being greater than 1.

Theorem 1 shows that the sequence converges linearly with a rate of if after a certain number of iterations, there are no data-falsification errors in the updates. Then, it can be easily shown that the sequence or converges to the minimizer, since the last two terms in (1) are removed. However, if the errors persist in the updates, this theorem shows how the errors are accumulated after each iteration. As a general result, one can further optimize over , , and to obtain maximal and minimal to achieve fastest convergence and least impact from the errors.

Theorem 2.

Let be chosen such that

(14)

where and , then

(15)

where

(16)

with

(17)

and

(18)
(19)

Theorem 2 presents a general convergence result of ADMM for decentralized consensus optimization with errors, and indicates that the erroneous update approaches the neighborhood of the minimizer in a linear fashion. The radius of the neighborhood is given as . Note that is not guaranteed to be less than . This is very different from the convergence result of ADMM for decentralized consensus optimization [16], which can guarantee that the update converge to the minimizer linearly fast and the corresponding rate is less than . If , and it ends up with being greater than , then the algorithm will not converge at all.

In particular, we define as the radius of the neighborhood in this paper. Thus, we divide the problem into two different parts. The first one is to guarantee that is within the range , and the second one is to minimize the radius of the neighborhood .

Accordingly, we optimize over the variables that appeared in the above theorems and the algorithm parameter , and give the convergence result with .

Theorem 3.

If and can be chosen, such that with

(20)

then the ADMM algorithm with a parameter converges linearly with a rate of , to the neighborhood of the minimizer with a radius of , which is

(21)

where

(22)
(23)

and

(24)

Theorem 3 provides an optimal set of choices of variables and the algorithm parameter such that and the radius is minimized. In a nutshell, and maximizes , guarantees , and minimizes the radius .

Since and recalling the condition in the theorem, we also need

(25)

One intuition is that we should design a network such that is the smallest possible. Substituting in the theorem into the expression (25), we have

(26)
Remark 1.

The value of , which corresponds to the network structure, has to be greater than a certain threshold such that a linear convergence rate of can be achieved. This shows that a decentralized network with a random structure may not converge at all to the neighborhood of the minimizer, when the ADMM algorithm is implemented with erroneous updates.

Remark 2.

It is easy to show that the right hand side of the inequality is strictly less than 1. Considering as the only variable in the expression on the right hand side, it is upper bounded by . Thus, if we can design a network such that its corresponding value of is greater than this bound, we can ensure that the decentralized ADMM algorithm can converge to the neighborhood of the minimizer.

Remark 3.

The right hand side of the above expression depends on the geometric property of the cost function. There exists a certain class of cost functions such that the value of the right hand side can be lowered, compared with other cost functions. Thus, it allows for a more flexible network structure design such that a linear convergence rate can be achieved.

Corollary 1.

If decreases linearly at a rate such that , and the constraints in Theorem 3 are satisfied, the algorithm converges to the minimizer linearly with a rate of .

This result simply states that if the error in the update decays faster than the distance between the update and the minimizer , then the algorithm will reach the minimizer at a linear rate.

Corollary 2.

If , and the constraints in Theorem 3 are satisfied, then an upper bound on the error for is .

This result shows that if the error at every iteration is bounded, then the algorithm will approach the bounded neighborhood of the minimizer.

Theorem 4.

If for each iteration,

(27)

with

(28)

and the constraints in Theorem 3 are satisfied, the algorithm converges to the minimizer linearly with a rate of .

Recall the result in Corollary 1, Theorem 4 essentially states that when the error decays faster than the distance between the update and the minimizer , the algorithm can approach the minimizer, which is intuitively true. However, Theorem 4 gives a much more general condition for convergence to the minimizer. Note that , , , and are fixed, and the condition (27) relates the current error to all the previously accumulated errors. The error impacts the performance of decentralized ADMM in two different ways. First, the error at an individual local agent makes the local update deviate from the true update. Second, the local error can propagate over the network through update exchange between neighboring nodes, thus impacting the update precision of the nodes later in the network. Hence, the errors that occurred before the current iteration can diffuse and get accumulated over the network. At this point, Theorem 4 gives an upper bound for the current error based on the past errors, such that the network can tolerate the accumulated errors and the convergence to the minimizer can still be guaranteed.

4 Robust ADMM

Based upon insights provided by our theoretical results in Section 3, we investigate the design of the robust ADMM algorithm which can tolerate the errors in the ADMM updates. We focus on the scenario where a fraction of the nodes generate erroneous updates. The other nodes in the network generate true updates (or their errors can be neglected), which are called honest nodes in this paper.

4.1 Byzantine Profiling

In this section, we describe a scheme for Byzantine profiling. Note that for a decentralized consensus optimization problem, the honest nodes iteratively approach a consensus. Thus, after a certain number of iterations, the Byzantine nodes behave quite differently from others. In particular, if the Byzantine nodes are removed from the neighborhood, the variance of the values of the updates would be significantly reduced.

Specifically, after iterations, agent starts to check the update quality of its neighbors. Say at -th iteration, agent constructs a new vector , using the updates from its neighbors , where . While comparing with , substitutes the elements indicating updates from agent with the mean of updates from agent and its neighbors, which is denoted by , i.e.,

(29)

Note that for the decentralized ADMM algorithm, the constraint guarantees a minimizer of consensus, and the norm is shown to be a good metric of the deviation from a consensus for the current update [12]. Then, if is greater than a predefined threshold , i.e., , it triggers an alarm and the profiling process starts.

First, for every agent , a unique vector is assigned, which is constructed in a similar way as , except that the elements of and are replaced with , the mean of the rest of the updates in , i.e.,

(30)

Then, a score is calculated for agent . Node is profiled by node as a Byzantine node with the following criterion

(31)

4.2 Robust Algorithm

According to Theorem 1, if the Byzantine nodes are correctly profiled and their updates are not used for iteration, the ADMM algorithm can still converge to the true minimizer. Thus, if agent has profiled its neighbor as a Byzantine, the update from will no longer be used for iteration in .

Even though it looks like every node has to maintain the knowledge of the whole network structure , it can be shown after simple manipulations that , which does not require the knowledge of the network structure and ensures the applicability of the scheme to networks with high mobility.

We give the corresponding pseudocode for ease of presentation.

1:function 
2:     Initialization: , , ,
3:     for  to  do
4:         Iteratively update using (7)
5:         if  then
6:              Start Byzantine Profiling
7:              Discard neighbor from future updates
8:         end if
9:     end for
10:end function
Algorithm 1 Robust Decentralized ADMM

5 Experiments

In this section, we use ADMM to solve the decentralized consensus optimization problem

A decentralized network with nodes is employed to perform the optimization task. We assume that there are Byzantine nodes in the network and the update is falsified by Gaussian random variables with mean and variance . We initialize and randomly with elements generated by standard Gaussian distribution .

The algorithm stops when the number of iterations reaches , and we record the average error (where average is over all the agents).

Figure 1: Performance comparison with different Byzantine attack intensity.

Figure shows the distance of the current update to the minimizer versus the number of iterations. We can see that if there are no Byzantine nodes in the network, the conventional ADMM converges quickly to the minimizer. However, in the presence of Byzantines, with and , it can be seen that the performance of the conventional ADMM degrades significantly. We can observe that the algorithm approaches a neighborhood of the minimizer and can not converge to the minimizer. The radius of the neighborhood depends on the strength () of the Byzantine attacks. With the parameters and , our proposed robust ADMM algorithm can perfectly identify the Byzantine nodes and achieve a comparable convergence speed as the case where there are no Byzantines. We found that, within iterations, our Byzantine profiling scheme can identify all the Byzantine nodes, can the algorithm starts to converge.

Figure 2: Performance comparison with different choices of algorithm parameter.

Next, we employ the derived optimal choice of the algorithm parameter and show the performance comparison. The optimal , which is termed as , is given in Theorem 3. We compare the performance of the robust algorithm in the cases where and . We can see clearly from Figure that with the optimal , the robust algorithm achieves a much faster convergence speed. Even though the optimal algorithm parameter is derived for the situation where there are Byzantine nodes, the conventional ADMM can also obtain an acceleration with the optimal .

6 Conclusion

We considered the problem of federated learning using ADMM in the presence of data falsification. We studied the convergence behavior of the decentralized ADMM algorithm and showed that the ADMM converges linearly to a neighborhood of the solution under certain conditions. We suggested a guideline for network structure design to achieve faster convergence. We also gave several conditions on the errors to obtain exact convergence to the solution. We proposed a robust ADMM scheme to enable federated learning in the presence of data falsifying Byzantines. We also gave simulation results to validate the analysis and showed the effectiveness of the proposed robust scheme. We assumed the strong convexity of the cost function, and one might follow our lines of analysis for general convex functions.

Acknowledgements

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. This work was funded under LLNL Laboratory Directed Research and Development project 17-ERD-101.

References

  • [1] A. Bnouhachem, H. Benazza, and M. Khalfaoui. An inexact alternating direction method for solving a class of structured variational inequalities. Applied Mathematics and Computation, 219(14):7837–7846, 2013.
  • [2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
  • [3] T.-H. Chang, M. Hong, and X. Wang. Multi-agent distributed optimization via inexact consensus ADMM. IEEE Transactions on Signal Processing, 63(2):482–497, 2015.
  • [4] J. C. Duchi, A. Agarwal, and M. J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3):592–606, March 2012.
  • [5] G. Gu, B. He, and J. Yang. Inexact alternating-direction-based contraction methods for separable linearly constrained convex optimization. Journal of Optimization Theory and Applications, 163(1):105–129, 2014.
  • [6] B. He, L.-Z. Liao, D. Han, and H. Yang. A new inexact alternating directions method for monotone variational inequalities. Mathematical Programming, 92(1):103–118, 2002.
  • [7] M. Hong and Z.-Q. Luo. On the linear convergence of the alternating direction method of multipliers. Mathematical Programming, 162(1):165–199, Mar 2017.
  • [8] B. Kailkhura, S. Brahma, and P. K. Varshney. Data falsification attacks on consensus-based detection systems. IEEE Transactions on Signal and Information Processing over Networks, 3(1):145–158, March 2017.
  • [9] J. Konečnỳ, B. McMahan, and D. Ramage. Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575, 2015.
  • [10] L. Lamport, R. Shostak, and M. Pease. The byzantine generals problem. ACM Trans. Program. Lang. Syst., 4(3):382–401, July 1982.
  • [11] L. Majzoobi and F. Lahouti. Analysis of distributed admm algorithm for consensus optimization in presence of error. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4831–4835, March 2016.
  • [12] A. Makhdoumi and A. Ozdaglar. Convergence rate of distributed ADMM over networks. IEEE Transactions on Automatic Control, 62(10):5082–5095, Oct 2017.
  • [13] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
  • [14] M. K. Ng, F. Wang, and X. Yuan. Inexact alternating direction methods for image recovery. SIAM Journal on Scientific Computing, 33(4):1643–1668, 2011.
  • [15] J. B. Predd, S. R. Kulkarni, and H. V. Poor. A collaborative training algorithm for distributed learning. IEEE Transactions on Information Theory, 55(4):1856–1871, April 2009.
  • [16] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Trans. Signal Processing, 62(7):1750–1761, 2014.
  • [17] G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein. Training neural networks without gradients: A scalable admm approach. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 2722–2731. JMLR.org, 2016.
  • [18] Y.-H. Xiao and H.-N. Song. An inexact alternating directions algorithm for constrained total variation regularized compressive sensing problems. Journal of Mathematical Imaging and Vision, 44(2):114–127, 2012.
  • [19] Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with Spectral Penalty Parameter Selection. In A. Singh and J. Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 718–727, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.
  • [20] X.-M. Yuan. The improvement with relative errors of He et al.’s inexact alternating direction method for monotone variational inequalities. Mathematical and computer modelling, 42(11-12):1225–1236, 2005.

Supplementary Materials

Lemma 1.

The update of the the algorithm can be written as

(32)
Proof.

Using the second step of the algorithm, we can write

(33)

and

(34)

Sum and telescope from iteration 0 to using (34), and we can get the following by assuming

(35)

Substitute the above result to the first step in the algorithm and it yields

(36)

which completes the proof. ∎

Lemma 2.

The sequences satisfy

(37)
Proof.

Based on Lemma 1 and the fact , we can write

(38)

Subtracting from both sides of the above equation provides

(39)

Rearrange and we have the desired result. ∎

Lemma 3.

The null space of null is span.

Proof.

Note that the null space of and are the same. By definition, and . Recall that if and is the th block of , then the th block of and the th block of are identity matrices ; otherwise the corresponding blocks are zero matrices . Therefore, is a matrix that each row has one “1”, one “-1”, and all zeros otherwise, which means , i.e., null=span.

Note that and , thus null=null, completing the proof. ∎

Lemma 4.

For some that satisfies and belongs to the column space of , the sequences satisfy

(40)
Proof.

Using Lemma 2, we have

(41)

According to Lemma 3, null is span. Since , can be written as a linear combination of column vectors of . Therefore, there exists such that . Let be the projection of onto to obtain where lies in the column space of .

Hence, we can write

(42)

Lemma 5.

Proof.

Since the optimal consensus solution has an identical value for all its entries, lies in the space spanned by . Thus, according to Lemma 3, we have the desired result, and also

Appendix A Proof of Theorem 1

Proof.
(43)
(44)
(45)
(46)
(47)
(48)
(49)
(50)
(51)
(52)
(53)

For any , using the basic inequality

(54)

we can write for and

(55)
(56)
(57)
(58)
(59)

Thus, for a positive quantity ,

(60)
(61)

Since , for any , we can get

(62)

Therefore, the addition of (60) and (62) yields