Asynchronous decentralized accelerated stochastic gradient descent
In this work, we introduce an asynchronous decentralized accelerated stochastic gradient descent type of method for decentralized stochastic optimization, considering communication and synchronization are the major bottlenecks. We establish (resp., ) communication complexity and (resp., ) sampling complexity for solving general convex (resp., strongly convex) problems.
In this paper, we consider the following decentralized optimization problem which is cooperatively solved by agents distributed over the network:
Here is a general convex objective function only known to agent and satisfying
for some and , where denotes the subdifferential of at , and is a closed convex constraint set of agent . (2) is a unified way of describing a wide range of problems. In particular, if is a general Lipschitz continuous function with constant , then (2) holds with and . If is a smooth and strongly convex function in (see [21, Section 1.2.2] for definition), (2) is satisfied with . Clearly, relation (2) also holds if is given as the summation of smooth and nonsmooth convex functions. Throughout the paper, we assume the feasible set is nonempty.
Decentralized optimization problems defined over complex multi-agent networks are ubiquitous in signal processing, machine learning, control, and other areas in science and engineering (see e.g. [23, 13, 24, 8]). One critical issue existing in decentralized optimization is that synchrony among network agents is usually inefficient or impractical due to processing and communication delays and the absence of a master server in the network. Note that and are private and only known to agent , and all agents intend to cooperatively minimize the system objective as the sum of all local objective ’s in the absence of full knowledge about the global problem and network structure. Decentralized algorithms, therefore, require agents to communicate with their neighboring agents iteratively to propagate the distributed information in the network. Under the synchronous setting, all agents must wait for the slowest agent and/or slowest communication channel/edge in the network, and a global coordinator must be presented for synchronization, which can be extremely expensive in the large-scale decentralized network.
Following the seminal work , extensive research work has been conducted in recent years to design asynchronous algorithmic schemes for decentralized optimization. Asynchronous gossip-based method under the edge-based random activation setting has been proposed by  to solve averaging consensus problems. Later  extended this framework for solving (1) and established almost surely convergence to the optimal solution when is smooth and convex. Most recently,  also achieved almost surely convergence by iteratively activating a subset of agents. Besides (sub)gradient based methods, another well-known approach relies on solving the saddle point formulation of (1) (see Section 2 for the reformulation), where at each iteration a pair of primal and dual variables is updated alternatively. The distributed ADMM (e.g., [12, 28, 31, 2]) has been studied in different asynchronous setting. More specifically, [12, 2] randomly selected and updated a subset of agents iteratively where  assuming being simple convex function and  establishing almost surely convergence for smooth convex objectives.  employed the node-based random activation and achieved the rate of convergence when is a simple convex function, and  later established the same rate of convergence by activating one agent per iteration. Most recently,  proposed an asynchronous parallel primal-dual type method and established almost surely convergence when is smooth and convex.
Asynchronous decentralized algorithms discussed above require the knowledge of exact (sub)gradients (or function values) of , however, this requirement is not realistic when dealing with minimization of generalized risk and online (streaming) data distributed over a network. There exists limited research on asynchronous decentralized stochastic optimization (e.g., [20, 27, 6]), for which only noisy gradient information of functions , , can be easily computed. While asynchronous decentralized stochastic first-order methods [20, 27] established error bounds when is (strongly) convex,  achieved rate of convergence for smooth and convex problems.
Recently  proposed a class of primal-dual type communication-efficient methods for decentralized stochastic optimization, which obtained the best-known (resp., ) communication complexity and the optimal (resp., ) sampling complexity for solving nonsmooth convex (resp., strongly convex) problems under the synchronous setting. This class of communication-efficient methods requires two rounds of communication involving all network agents per iteration, and hence may incur huge synchronous delays. Moreover, it was proposed to solve decentralized nonsmooth problems so that its convergence property is not clear when applying it to solve decentralized problems satisfying (2). Inspired by , we aim to propose an asynchronous decentralized algorithmic framework to solve (1) under a more general setting (2) but still maintains the complexity bounds achieved in . Our main contributions in this paper can be summarized as follows. Firstly, we introduce a doubly randomized primal-dual method, namely, asynchronous decentralized primal-dual (ADPD) method, which randomly activates two agents per iteration, and hence two rounds of communication between the activated agent and its neighboring agents are performed. This proposed method can find a stochastic -optimal solution in terms of both the primal optimality gap and feasibility residual in communication rounds when the objective functions are simple convex such that the local proximal subproblems can be solved exactly.
Secondly, we present a new asynchronous stochastic decentralized primal-dual type method, called asynchronous accelerated stochastic decentralized communication sliding (AA-SDCS) method, for solving decentralized stochastic optimization problems. It should be pointed out that AA-SDCS is a unified algorithm that can be applied to solve a wild range of problems under the general setting of (2). In particular, only (resp., ) communication rounds are required while agents perform a total of (resp., ) stochastic (sub)gradient evaluations for general convex (resp., strongly convex) functions. Moreover, the latter bounds, a.k.a. sampling complexities, of AA-SDCS can achieve a better dependence on the Lipschitz constant when the objective function contains a smooth component, i.e., in (2), than other existing decentralized stochastic first-order methods. Only requiring the access to stochastic (sub)gradients at each iteration, AA-SDCS is particularly efficient for solving problems with , which provides a communication-efficient way to deal with streaming data and decentralized machine learning. We summarized the achieved communication and sampling complexities in this paper in Table 1.
|Problem type:||Communication Complexity||Sampling Complexity|
|Our results||Existing results||Our results||Existing results|
Thirdly, we demonstrate the advantages of the proposed methods through preliminary numerical experiments for solving decentralized support vector machine (SVM) problems with real data sets. For all testing problems, AA-SDCS can significantly save CPU running time over existing state-of-the-art decentralized methods.
To the best of our knowledge, this is the first time that these asynchronous communication sliding algorithms, and the aforementioned separate complexity bounds on communication rounds and stochastic (sub)gradient evaluations under the asynchronous setting are presented in the literature.
This paper is organized as follows. In Section 2, we introduce the problem formulation and provide some preliminaries on distance generating functions and prox-functions. We present our main asynchronous decentralized primal-dual framework and establish their convergence properties in Section 3. Section 4 is devoted to providing some preliminary numerical results to demonstrate the advantages of our proposed algorithms. The proofs of the main theorems in Section 3 are provided in Appendix A.
Notation and Terminologies. We denote by and the vector of all zeros and ones whose dimensions vary from the context. The cardinality of a set is denoted by . We use to denote the identity matrix in . We use for matrices and to denote their Kronecker product of size . For a matrix , we use to denote the entry of -th row and -th column. For any , the set of integers is denoted by .
2 Problem setup
Consider a multi-agent network system whose communication is governed by an undirected graph , where indexes the set of agents, and represents the pairs of communicating agents. If there exists an edge from agent to denote by , agent may exchange information with agent . Therefore, each agent can directly receive (resp., send) information only from (resp., to) the agents in its neighborhood where we assume that there always exists a self-loop for all agents , with no communication delay. The associated Laplacian of is defined as
We introduce an individual copy of the decision variable for each agent . Hence, by employing the Laplacian matrix , (1) can be written compactly as
where , , , and . The constraint is a compact way of writing for all pairs . In view of Theorem 4.2.12 in , is symmetric positive semidefinite and its null space coincides with the “agreement” subspace, i.e., . To ensure each agents can obtain information from every other agents, we need the following assumption as a blanket assumption throughout the paper.
The graph is connected.
Under Assumption 1, problem (1) and (4) are equivalent. We next consider a reformulation of (4). By the method of Lagrange multipliers, problem (4) is equivalent to the following saddle point problem:
where are the Lagrange multipliers associated with the constraints . We assume that there exists an optimal solution of (4) and that there exists such that is a saddle point of (5). Finally, we define the following terminology.
A point is called a stochastic -solution of (4) if
We say that has primal residual and feasibility residual .
Note that for problem (4), the feasibility residual measures the disagreement among the local copies , for . We will use these two criteria to evaluate the output solutions of the algorithms proposed in this paper.
2.1 Distance generating function and prox-function
Prox-function, also known as proximity control function or Bregman distance function , has played an important role as a substantial generalization of the Euclidean projection, since it can be flexibly tailored to the geometry of a constraint set .
For any convex set equipped with an arbitrary norm , we say that a function is a distance generating function with modulus with respect to , if is continuously differentiable and strongly convex with modulus with respect to , i.e., The prox-function induced by is given by
We now assume that the constraint set for each agent in (1) is equipped with norm , and its associated prox-function is given by . It then follows from the strong convexity of that
Throughout the paper, we endow the dual space where the multipliers of (5) reside with the standard Euclidean norm , since the feasible region of is unbounded. For simplicity, we often write instead of for a dual multiplier .
3 The algorithms
In this section, we introduce an asynchronous decentralized primal-dual framework for solving (1) in the decentralized setting. Specifically, two asynchronous methods are presented, namely asynchronous decentralized primal-dual method in Subsection 3.1 and asynchronous accelerated stochastic decentralized communication sliding in Subsection 3.2, respectively. Moreover, we establish complexity bounds (number of inter-node communication rounds and/or intra-node stochastic (sub)gradient evaluations) separately in terms of primal functional optimality gap and constraint (or consistency) violation for solving (1)-(4).
3.1 Asynchronous decentralized primal-dual method
Our main goals in this subsection are to introduce the basic scheme of asynchronous decentralized primal-dual (ADPD) method, as well as establishing its complexity results. Throughout this subsection, we assume that is a simple function such that we can solve the primal subproblem (15) explicitly.
We formally present the ADPD method in Algorithm 1. Each agent maintains two local sequences, namely, the primal estimates and the dual variables . All primal estimates and are locally initialized from some arbitrary point in , and each dual variable . At each iteration , only one randomly selected agent (cf. activated agent) updates its dual variable , and then one randomly selected agent updates its primal variable . In particular, each agent in the activated agent’s neighborhood, i.e., agents , computes a local prediction using the two previous primal estimates (ref. (10)), and send it to agent . In (11)-(12), the activated agent calculates its neighborhood disagreement using the receiving messages, and updates the dual variable . Other agents’ dual variables remain unchanged. Then, another round of communication (14) between the activated agent and its neighboring agents occurs after the dual prediction step (13). Lastly, the activated agent solves the proximal projection subproblem (15) to update , and other agents’ primal estimates remain the same as the last iteration.
It should be emphasized that each iteration only involves two communication rounds (cf. (11) and (14)) between the activated agents and its neighboring agents, which significantly reduces synchronous delays appearing in many decentralized methods (e.g., [7, 25, 26, 15]), since these methods require at least one communication round between all agents and their neighboring agents iteratively. Also note that similar to the asynchronous ADMM proposed in , ADPD employs node-based activation. However, while  requires all agents to update dual variables iteratively based on the information obtaining from communication, in ADPD only the activated agent needs to collect neighboring information and update its dual variable (see (11) and (12)), and hence ADPD further reduces communication costs and synchronous delays comparing to . Moreover, ADPD can achieve the same rate of convergence as  under the assumption that (15) can be solved explicitly. We will demonstrate later that by exploiting the strong convexity, an improved rate of convergence can be obtained.
In the following theorem, we provide a specific selection of , and , which leads to complexity bounds for the functional optimality gap and also the feasibility residual to obtain a stochastic -solution of (4).
Observed that in Algorithm 1, we assume that ’s are simple functions such that (15) can be solved explicitly. However, since ’s are possibly nonsmooth functions and/or possess composite structures, it is often difficult to solve (15) especially when is provided in the form of expectation. In the next subsection, we present a new asynchronous stochastic decentralized primal-dual type method, called the asynchronous accelerated stochastic decentralized communication sliding (AA-SDCS) method, for the case when (15) is not easy to solve.
3.2 Asynchronous accelerated stochastic decentralized communication sliding
In the subsection, we show that one can still maintain the same number of inter-node communications even when the subproblem (15) is approximately solved through an optimal stochastic approximation method, namely AC-SA proposed in [10, 9, 14], and that the total number of required stochastic (sub)gradient evaluations (or sampling complexity) is comparable to centralized mirror descent methods. Throughout this subsection, we assume that only noisy (sub)gradient information of , , is available or easier to compute. This situation happens when the function ’s are given either in the form of expectation or as the summation of lots of components. Moreover, we assume that the first-order information of the function , , can be accessed by a stochastic oracle (SO), which, given a point , outputs a vector such that
where is a random vector which models a source of uncertainty and is independent of the search point , and the distribution is not known in advance. We call a stochastic (sub)gradient of at . Observe that this assumption covers the case that one can access the exact (sub)gradients of whenever .
In order to exploit the strong convexity of the prox-function , we assume in this subsection that each prox-function (cf. (7)) are growing quadratically with the quadratic growth constant , i.e., there exists a constant such that
By (8), we must have .
We now add a few comments about Algorithm 2. Firstly, similar to SDCS proposed in , AA-SDCS exploits two loops: the doubly randomized primal-dual scheme as outer loop and the ACS procedure as inner loop. More specifically, AA-SDCS utilizes the AC-SA method proposed in [10, 9, 14] to approximately solve the primal subproblem in (15), which provides a unified scheme for solving a general class of problems defined in (2) and leads to accelerated rate of convergence when possesses smooth structure. Secondly, the same dual information (see (26)) has been used throughout the iterations of the ACS procedure, and hence no additional communication is required within the procedure. Finally, since AA-SDCS randomly selects one subproblem (15) and solved it inexactly, the outer loop also needs to be carefully designed to attain the best possible rate of convergence. In fact, the ACS procedure provides two approximate solutions of (15): one is the primal estimate and the other is , which will be maintained by each agent and later play a crucial role in the development and convergence analysis of AA-SDCS. We also accordingly modify the primal extrapolation step of the outer loop (cf. (22)). For later convenience, we refer to the subproblem ACS solved at iteration as , i.e.,
Theorem 2 provides a specific selection of , , and for Algorithm 2, and and for the ACS procedure, which leads to complexity bounds for the functional optimality gap and also the feasibility residual to obtain a stochastic -solution of (4).
In view of Theorem 2, letting , we can see that the total number of inter-node communication rounds and intra-node (sub)gradient evaluations required by AA-SDCS for finding a stochastic -solution of (4) can be bounded by
respectively. It also needs to be emphasized that the sampling complexity (second bound in (36)) only sublinearly depends on the Lipschitz constant .
Now consider the case when ’s are strongly convex (i.e., in (2)). The following theorem instantiates Algorithm 2 by providing a selection of , , and , which leads to a improved complexity bound for the functional optimality gap and also the feasibility residual to obtain a stochastic -solution of (4).
4 Numerical experiments
We demonstrate the advantages of our proposed AA-SDCS method over the state-of-art synchronous algorithm, stochastic decentralized communication sliding (SDCS) method, proposed in  through some preliminary numerical experiments.
Let us consider the decentralized linear Support Vector Machines (SVM) model with the following hinge loss function
where is the pair of class label and feature vector, and denotes the weight vector. We consider two types of stochastic decentralized linear SVM problems in this paper. For the convex case, we study -norm SVM problem [32, 4] defined in (41), while for the strongly convex case, we study -norm SVM model defined in (42). Moreover, we use the Erhos-Renyi algorithm 111We implemented the Erhos-Renyi algorithm based on a MATLAB function written by Pablo Blider, which can be found in https://www.mathworks.com/matlabcentral/fileexchange/4206. to generate the underlying decentralized network. Note that nodes with different degrees are drawn in different colors (cf. Figure1). We also used the real dataset named “ijcnn1” from LIBSVM222This real dataset can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. and drew samples from this dataset as our problem instance data to train the decentralized linear SVM model. These samples are evenly split over the network agents. For example, if we have nodes (or agents) in the decentralized network (see Figure 1), each network agent has samples.
With the same initial points and , we compare the performances of our algorithms with the SDCS method  for solving (1)-(4) by reporting the progresses of objective function values and feasibility residuals versus the elapsed CPU running time (in seconds) for solving the aforementioned two different types of problems. In all problem instances, we use norm in both the primal and dual spaces, and hence in the parameter settings of SDCS refers to the maximum eigenvalue of the Laplacian matrix . Moreover, all algorithms are implemented in MATLAB R2016a and run in the computer environment of with -core (Intel(R) Xeon(R) CPU E5-2673 v3 GHz) virtual machine on Microsoft Azure. Since the underlying network has agents, we utilized the parallel toolbox in MATLAB to simulate the synchronous setting for SDCS. However, inter-node communication is instant and no delay is simulated in all experiments. In fact, such simulation setup is in favor of the synchronous methods, since these methods can be heavily slowed down by different processing speeds of the agents (cores) and inter-node communication speeds.
where represents a uniform random variable with support and denotes the dataset belonging to node . We compare the performances of AA-SDCS with SDCS for the decentralized network setups, (cf. R. Figure 1). For all problem instances, we choose the parameters of AA-SDCS as in Theorem 2. For SDCS, we choose parameters as suggested in .
In Figure 2, the vertical-axis of the left subgraph represents the objective function values, the vertical-axis of the right subgraph represents the feasibility measure , and the horizontal-axis is the elapsed CPU running time in seconds. These numerical results are consistent with our theoretical analysis. We also need to emphasize that AA-SDCS can significantly save CPU running time over SDCS in terms of both objective function values and feasibility residuals as shown in Figure 2 even when each agent (Core) has the same processing speed.
Strongly convex case: decentralized -norm SVM Consider a decentralized linear SVM problem with regularizer defined over the -agent decentralized network as the following
We compare the performances of AA-SDCS with SDCS for the decentralized network setups, (cf. R. Figure 1). For all problem instances, we choose the parameters of AA-SDCS as in Theorem 3. For SDCS, we choose parameters as suggested in .
The above figures clearly show that AA-SDCS can significantly save CPU running time over SDCS in terms of both objective function values and feasibility residuals. Moreover, comparing Figure 3 with Figure 2, we can find out AA-SDCS obtains more improvements over SDCS for solving decentralized -norm SVM problems than decentralized -norm SVM problems. In fact, the decentralized -norm SVM problem defined in (42) has a composite objective structure that consists of a nonsmooth hinge loss function and a smooth strongly convex -regularizer, and the convergence results of AA-SDCS has a better dependence on the Lipschitz constant , which indicates that it can obtain a faster convergence speed than SDCS for solving decentralized -norm SVM problems.
-  D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena Scientific, Belmont, 1997.
-  Pascal Bianchi, Walid Hachem, and Franck Iutzeler. A coordinate descent primal-dual algorithm and application to distributed asynchronous optimization. IEEE Transactions on Automatic Control, 61(10):2947–2957, 2016.
-  Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algorithms. IEEE Trans. Inform. Theory, 52(6):2508–2530, June 2006.
-  Paul S Bradley and Olvi L Mangasarian. Feature selection via concave minimization and support vector machines. In ICML, volume 98, pages 82–90, 1998.
-  L.M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200 – 217, 1967.
-  T. Chang and M. Hong. Stochastic proximal gradient consensus over random networks. http://arxiv.org/abs/1511.08905, 2015.
-  J.C. Duchi, A. Agarwal, and M.J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3):592 –606, March 2012.
-  J. W. Durham, A. Franchi, and F. Bullo. Distributed pursuit-evasion without mapping or global localization via local frontiers. Autonomous Robots, 32(1):81–95, 2012.
-  S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
-  S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013.
-  Roger A Hom and Charles R Johnson. Topics in matrix analysis. Cambridge UP, New York, 1991.
-  F. Iutzeler, P. Bianchi, P. Ciblat, and Walid Hachem. Asynchronous distributed optimization using a randomized alternating direction method of multipliers. http://arxiv.org/pdf/1303.2837, 2013.
-  A. Jadbabaie, Jie Lin, and A.S. Morse. Coordination of groups of mobile autonomous agents using nearest neighbor rules. IEEE Transactions on Automatic Control, 48(6):988 – 1001, June 2003.
-  G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365–397, 2012.
-  Guanghui Lan, Soomin Lee, and Yi Zhou. Communication-efficient algorithms for decentralized and stochastic optimization. arXiv preprint arXiv:1701.03961, 2017.
-  S. Lee and A. Nedić. Gossip-based random projection algorithm. In Proceedings of the 46th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, November 2012.
-  R. D. C. Monteiro and B. F. Svaiter. On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean. SIAM Journal on Optimization, 20(6):2755–2787, 2010.
-  R. D. C. Monteiro and B. F. Svaiter. Complexity of variants of tseng’s modified f-b splitting and korpelevich’s methods for hemivariational inequalities with applications to saddle-point and convex optimization problems. SIAM Journal on Optimization, 21(4):1688–1720, 2011.
-  R. D. C. Monteiro and B. F. Svaiter. Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM Journal on Optimization, 23(1):475–507, 2013.
-  A. Nedić. Asynchronous broadcast-based convex optimization over a network. IEEE Trans. Automat. Contr., 56(6):1337–1351, 2011.
-  Y. E. Nesterov. Introductory Lectures on Convex Optimization: a basic course. Kluwer Academic Publishers, Massachusetts, 2004.
-  Y. Ouyang, Y. Chen, G. Lan, and E. Pasiliao Jr. An accelerated linearized alternating direction method of multipliers. SIAM Journal on Imaging Sciences, 8(1):644–681, 2015.
-  M. Rabbat and R. D. Nowak. Distributed optimization in sensor networks. In IPSN, pages 20–27, 2004.
-  S. S. Ram, V. V. Veeravalli, and A. Nedić. Distributed non-autonomous power control through distributed convex optimization. In IEEE INFOCOM, pages 3001–3005, 2009.
-  W. Shi, Q. Ling, G. Wu, and W. Yin. On the linear convergence of the admm in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7):1750–1761, 2014.
-  W. Shi, Q. Ling, G. Wu, and W. Yin. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944â–966, 2015.
-  K. Srivastava and A. Nedić. Distributed asynchronous constrained stochastic optimization. IEEE Journal of Selected Topics in Signal Processing, 5(4):772–790, 2011.
-  E. Wei and A. Ozdaglar. On the convergence of asynchronous distributed alternating direction method of multipliers. http://arxiv.org/pdf/1307.8254, 2013.
-  Tianyu Wu, Kun Yuan, Qing Ling, Wotao Yin, and Ali H Sayed. Decentralized consensus optimization with asynchrony and delays. In Signals, Systems and Computers, 2016 50th Asilomar Conference on, pages 992–996. IEEE, 2016.
-  Jinming Xu, Shanying Zhu, Yeng Chai Soh, and Lihua Xie. Convergence of asynchronous distributed gradient methods over stochastic networks. IEEE Transactions on Automatic Control, 63(2):434–448, 2018.
-  Guoqiang Zhang and Richard Heusdens. Bi-alternating direction method of multipliers over graphs. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 3571–3575. IEEE, 2015.
-  Ji Zhu, Saharon Rosset, Robert Tibshirani, and Trevor J Hastie. 1-norm support vector machines. In Advances in neural information processing systems, pages 49–56, 2004.
Appendix A Convergence analysis
a.1 Some basic tools: gap functions, termination criteria and technical results
Given a pair of feasible solutions and of (5), we define the primal-dual gap function by