Accelerated Gossip via Stochastic Heavy Ball Method^{†}^{†}thanks: Accepted for publication to 56th Annual Allerton Conference on Communication, Control, and Computing. This work appeared first time online on 9th July 2018.
Abstract
In this paper we show how the stochastic heavy ball method (SHB)—a popular method for solving stochastic convex and nonconvex optimization problems—operates as a randomized gossip algorithm. In particular, we focus on two special cases of SHB: the Randomized Kaczmarz method with momentum and its block variant. Building upon a recent framework for the design and analysis of randomized gossip algorithms [20] we interpret the distributed nature of the proposed methods. We present novel protocols for solving the average consensus problem where in each step all nodes of the network update their values but only a subset of them exchange their private values. Numerical experiments on popular wireless sensor networks showing the benefits of our protocols are also presented.
Average Consensus Problem, Linear Systems, Networks, Randomized Gossip Algorithms, Randomized Kaczmarz, Momentum, Acceleration
I Introduction
Average consensus is a fundamental problem in distributed computing and multiagent systems. It comes up in many real world applications such as coordination of autonomous agents, estimation, rumour spreading in social networks, PageRank and distributed data fusion on adhoc networks and decentralized optimization. Due to its great importance there is much classical [35, 7] and recent [38, 37, 4] work on the design of efficient algorithms/protocols for solving it.
One of the most attractive classes of protocols for solving the average consensus are gossip algorithms. The development and design of gossip algorithms was studied extensively in the last decade. The seminal 2006 paper of Boyd et al. [4] on randomized gossip algorithms motivated a fury of subsequent research and now gossip algorithms appear in many applications, including distributed data fusion in sensor networks [38], load balancing [6] and clock synchronization [11]. For a survey of selected relevant work prior to 2010, we refer the reader to the work of Dimakis et al. [8]. For more recent results on randomized gossip algorithms we suggest [40, 17, 28, 20, 24, 1]. See also [9, 2, 29, 14].
The main goal in the design of gossip protocols is for the computation and communication to be done as quickly and efficiently as possible. In this work, our focus is precisely this. We design randomized gossip protocols which converge to consensus fast.
Ia The average consensus problem
In the average consensus (AC) problem we are given an undirected connected network with node set and edges . Each node “knows” a private value . The goal of AC is for every node to compute the average of these private values, , in a distributed fashion. That is, the exchange of information can only occur between connected nodes (neighbors).
IB Main Contributions
We present a new class of randomized gossip protocols where in each iteration all nodes of the network update their values but only a subset of them exchange their private information. Our protocols are based on recently proposed ideas for the acceleration of randomized Kaczmarz methods for solving consistent linear systems [22] where the addition of a momentum term was shown to provide practical speedups over the vanilla Kaczmarz methods. Further, we explain the connection between gossip algorithms for solving the average consensus problem, Kaczmarztype methods for solving consistent linear systems, and stochastic gradient descent and stochastic heavy ball methods for solving stochastic optimization problems. We show that essentially all these algorithms behave as gossip algorithms. Finally, we explain in detail the gossip nature of two recently proposed fast Kacmzarztype methods: the randomized Kacmzarz with momentum (mRK), and its block variant, the randomized block Kaczmarz with momentum (mRBK). We present a detailed comparison of our proposed gossip protocols with existing popular randomized gossip protocols and through numerical experiments we show the benefits of our methods.
IC Structure of the paper
This work is organized as follows. Section II introduces the important technical preliminaries and the necessary background for understanding of our methods. A new connection between gossip algorithms, Kaczmarz methods for solving linear systems and stochastic gradient descent (SGD) for solving stochastic optimization problems is also described. In Section III the two new accelerated gossip protocols are presented. Details of their behaviour and performance are also explained. Numerical evaluation of the new gossip protocols is presented in Section IV. Finally, concluding remarks are given in Section V.
ID Notation
The following notational conventions are used in this paper. We write . Boldface uppercase letters denote matrices; is the identity matrix. By we denote the solution set of the linear system , where and . Throughout the paper, is the projection of onto (that is, is the solution of the best approximation problem; see equation (5)). An explicit formula for the projection of onto set is given by
A matrix that often appears in our update rules is
(1) 
where is a random matrix drawn in each step of the proposed methods from a given distribution , and denotes the MoorePenrose pseudoinverse. Note that is a random symmetric positive semidefinite matrix.
In the convergence analysis we use to indicate the smallest nonzero eigenvalue, and for the largest eigenvalue of matrix , where the expectation is taken over . Finally, represents the vector with the private values of the nodes of the network at the iteration while with we denote the value of node at the iteration.
Ii BackgroundTechnical Preliminaries
Our work is closely related to two recent papers. In [20], a new perspective on randomized gossip algorithms is presented. In particular, a new approach for the design and analysis of randomized gossip algorithms is proposed and it was shown how the Randomized Kaczmarz and Randomized Block Kaczmarz, popular methods for solving linear systems, work as gossip algorithms when applied to a special system encoding the underlying network. In [22], several classes of stochastic optimization algorithms enriched with heavy ball momentum were analyzed. Among the methods studied are: stochastic gradient descent, stochastic Newton, stochastic proximal point and stochastic dual subspace ascent.
In the rest of this section we present the main results of the above papers, highlighting several connections. These results will be later used for the development of the new randomized gossip protocols.
Iia Kaczmarz Methods and Gossip Algorithms
Kaczmarztype methods are very popular for solving linear systems with many equations. The (deterministic) Kaczmarz method for solving consistent linear systems was originally introduced by Kaczmarz in 1937 [15]. Despite the fact that a large volume of papers was written on the topic, the first provably linearly convergent variant of the Kaczmarz method—the randomized Kaczmarz Method (RK)—was developed more than 70 years later, by Strohmer and Vershynin [32]. This result sparked renewed interest in design of randomized methods for solving linear systems [25, 26, 10, 23, 39, 27, 31, 18]. More recently, Gower and Richtárik [12] provide a unified analysis for several randomized iterative methods for solving linear systems using a sketchandproject framework. We adopt this framework in this paper.
In particular, the sketchandproject algorithm [12] for solving the consistent linear system has the form
(2)  
where in each iteration matrix is sampled afresh from an arbitrary distribution . In [12] it was shown that many popular algorithms for solving linear systems, including RK method and randomized coordinate descent method can be cast as special cases of the above update by choosing^{1}^{1}1In order to recover a randomized coordinate descent method, one also needs to perform projections with respect to a more general Euclidean norm. However, for simplicity, in this work we only consider the standard Euclidean norm. an appropriate distribution . The special cases that we are interested in are the randomized Kaczmarz (RK) and its block variant, the randomized block Kaczmarz (RBK).
Let be the unit coordinate vector in and let be column submatrix of the identity matrix with columns indexed by . Then RK and RBK methods can be obtained as special cases of the update rule (2) as follows:
In this paper we are interested in two particular extension of the above methods: the randomized Kaczmarz method with momentum (mRK) and its block variant, the randomized block Kaczmarz with momentum (mRBK), both proposed and analyzed in [22]. Before we describe these two algorithms, let us summarize the main connections between the Kaczmarz methods for solving linear systems and gossip algorithms, as presented in [20].
In [13, 30, 22], it was shown that even in the case of consistent linear systems with multiple solutions, Kaczmarztype methods converge linearly to one particular solution: the projection of the initial iterate onto the solution set of the linear system. This naturally leads to the formulation of the best approximation problem:
(5) 
Above, and is the standard Euclidean norm. By we denote the solution of (5).
In [20] it was shown how RK and RBK work as gossip algorithms when applied to a special linear system encoding the underlying network.
Definition II.1 ([20])
A linear system is called “average consensus (AC) system” when is equivalent to saying that for all .
Note that many linear systems satisfy the above definition. For example, we can choose and to be the incidence matrix of . In this case, the row of the system corresponding to edge directly encodes the constraint . A different choice is to pick and , where is the Laplacian of . Note that depending on what AC system is used, RK and RBK have different interpretations as gossip protocols.
From now on we work with the AC system described in the first example. Since , the general sketchandproject update rule (2) simplifies to:
(6) 
The convergence performance of RK and RBK for solving the best approximation problem (and as a result the average consensus problem) is described by the following theorem.
In [20], the behavior of both RK and RBK as gossip algorithms was described, and a comparison with the convergence results of existing randomized gossip protocols was made. In particular, it was shown that the most basic randomized gossip algorithm [4] (“randomly pick an edge and then replace the values stored at vertices and by their average”) is an instance of RK applied to the linear system , where the is the incidence matrix of . RBK can also be interpreted as a gossip algorithm:
Theorem II.3 ([20], RBK as a Gossip Algorithm)
Each iteration of RBK for solving works as follows: 1) Select a random set of edges , 2) Form subgraph of from the selected edges, 3) For each connected component of , replace node values with their average.
IiB The Heavy Ball momentum
A detailed study of several (equivalent) stochastic reformulations of consistent linear systems was developed in [30]. This new viewpoint facilitated the development and analysis of relaxed variants (with relaxation parameter ) of the sketchandproject update (2). In particular, one of the reformulations is the stochastic optimization problem
(7) 
(8) 
and is the random symmetric positive semidefinite matrix defined in (1).
Under certain (weak) condition on , the set of minimizers of is identical to the set of the solutions of the linear system. In [30], problem (7) was solved via Stochastic Gradient Descent (SGD):
(9) 
and a linear rate of convergence was proved despite the fact that is not necessarily strongly convex and that a fixed stepsize is used. Observe that the gradient of the stochastic function (8) is given by
(10) 
and as a result, it is easy to see that for , the SGD update (9) reduces to the sketchandproject update (2).
The recent works [21, 22] analyze momentum variants of SGD, with the goal to accelerate the convergence of the method for solving problem (7). SGD with momentum—also known as the stochastic heavy ball method (SHB)—is a well known algorithm in the optimization literature for solving stochastic optimization problems, and it is extremely popular in areas such as deep learning [33, 34, 16, 36]. However, even though SHB is used extensively in practice, its theoretical convergence behavior is not well understood. To the best of our knowledge, [21, 22] are the first that prove linear convergence of SHB in any setting.
The update rule of SHB for solving problem (7) is formally presented in the following algorithm:
Using the expression for the stochastic gradient (10), the update rule of SHB can be written more explicitly:
(11) 
Using the same choice of distribution as in equation (3) and (4), we now obtain momentum variants of RK and RBK:

RK with momentum (mRK):

RBK with momentum (mRBK):
In [22], two main theoretical results describing the behavior of SHB (and as a result also the special cases mRK and mRBK) were presented:
Theorem II.4 (Theorem 1, [22])
Choose . Let be the sequence of random iterates produced by SHB. Let (resp, ) be the smallest nonzero (resp. largest) eigenvalue of . Assume and and that the expressions and satisfy . Then
(12) 
and where and . Moreover, .
Theorem II.5 (Theorem 4, [22])
Let be the sequence of random iterates produced by SHB, started with , with relaxation parameter (stepsize) and momentum parameter . Let . Then there exists a constant such that for all we have
Using Theorem II.5 and by a proper combination of the stepsize and the momentum parameter , SHB enjoys an accelerated linear convergence rate in mean, [22].
Corollary 1
(i) If and , then the iteration complexity of SHB becomes: .
(ii) If and , then the iteration complexity of SHB becomes: .
Iii Randomized Gossip protocols with momentum
Having presented SHB for solving the stochastic optimization problem (7) and describing its sketchandproject nature (11), let us now describe its behavior as a randomized gossip protocol when applied to solving the AC system , where is the incidence matrix of the network.
Since , method (11) can be simplified to:
(13) 
In the rest of this section we focus on two special cases of (13): RK with momentum and RBK with momentum.
Iiia Randomized Kaczmarz Gossip with momentum
When RK is applied to solve an AC system , one recovers the famous pairwise gossip algorithm [4]. Algorithm 2 describes how the relaxed variant of randomized Kaczmarz with momentum behaves as a gossip algorithm. See also Figure (1) for a graphical illustration of the method.
Remark III.1
In the special case with (zero momentum) only the two nodes of edge update their values. In this case the two nodes do not update their values to their exact average but to a convex combination that depends on the stepsize . To obtain the pairwise gossip algorithm of [4], we should further choose .
Distributed Nature of the Algorithm: Here we highlight a few ways to implement mRK in a distributed fashion: Asynchronous pairwise broadcast gossip: In this protocol each node of the network has a clock that ticks at the times of a rate 1 Poisson process. The intertick times are exponentially distributed, independent across nodes, and independent across time. This is equivalent to a global clock ticking at a rate Poisson process which wakes up an edge of the network at random. In particular, in this implementation mRK works as follows: In the iteration (time slot) the clock of node ticks and node randomly contact one of its neighbors and simultaneously broadcast a signal to inform the nodes of the whole network that is updating (this signal does not contain any private information of node ). The two nodes share their information and update their private values following the update rule of Algorithm 2 while all the other nodes updating their values using their own information. In each iteration only one pair of nodes exchange their private values.
Synchronous pairwise gossip: In this protocol a single global clock is available to all nodes. The time is assumed to be slotted commonly across nodes and in each time slot only a pair of nodes of the network is randomly activated and exchange their information following the update rule of Algorithm 2. The remaining not activated nodes update their values using their own last two private values. Note that this implementation of mRK comes with the disadvantage that requires a central entity which choose the activate pair of nodes in each step.
Asynchronous pairwise gossip with common counter: The update rule of the nodes of the active pair in Algorithm 2 can be rewritten as follows:
In particular observe that in their update rule they have the expression which is precisely the update of all non activate nodes of the network. Thus if we assume that the nodes share a common counter that counts how many iterations take place and each node saves also the last iterate that it was activated then the algorithm can work in distributed fashion as follows:
Let us denote the number of total iterations (common counter) that becomes available to the activate nodes of each step as and let us define with the number of iterations between the current iterate and the last time that the node is picked (iteration ) then the update rule of the Algorithm 2 can be equivalently expressed as:

Pick an edge at random following .

The private values of the nodes are updated as follows:
; for any other node :
IiiB Connection with the accelerated gossip algorithm
In the randomized gossip literature there in one particular method closely related to our approach. It was first proposed in [5] and its analysis under strong conditions was presented in [17]. In this paper local memory is exploited by installing shift registers at each agent. In particular we are interested in the case of just two registers where the first stores the agent’s current value and the second the agent’s value before the latest update. The algorithm can be described as follows. Suppose that edge is chosen at time . Then,

Node :

Node :

Any other node :
where . The method was analyzed in [17] under a strong assumption on the probabilities of choosing the pair of nodes that as the authors mentioned is unrealistic in practical scenarios, and for networks like the random geometric graphs. At this point we should highlight that the results presented in [22] hold for essentially any distribution and as a result such a problem cannot occur.
Note also that if we choose in the update rule of Algorithm 2, then our method is simplified to:

Node :

Node :

Any other node :
In order to apply Theorem II.4, we need to assume that and which also means that . Thus for and momentum parameter it is easy to see that our approach is very similar to the shiftregister algorithm. Both methods update the selected pair of nodes in the same way. However, in our case the other nodes of the network do not remain idle but instead also update their values using their own previous information.
Using the momentum matrix , the two algorithms above can be expressed as:
(14) 
In particular, in our algorithm every element on the diagonal is equal to , while in [5] all values on the diagonal are zeros except for the two values .
Remark III.2
The shift register case and our algorithm can be seen as two limit cases of the update rule (14). In particular, the shift register method uses only two nonzero diagonal elements in , while our method has a full diagonal. We believe that further methods can be developed in the future by exploring the cases where more than two but not all elements of the diagonal matrix are nonzero. It might be possible to obtain better convergence if one carefully chooses these values based on the network topology. We leave this as an open problem for future research.
IiiC Randomized block Kaczmarz gossip with momentum
Recall that Theorem II.3 says how RBK (with no momentum and no relaxation) can be interpreted as a gossip algorithm. Now we use this result to explain how relaxed RBK with momentum works. Note that the update rule of RBK with momentum can be rewritten as follows:
(15) 
where is the update rule of RBK (6).
Thus, in analogy to the simple RBK, in the step, a random set of edges is selected and connected components are formed as a result. This includes the connected components that belong to both subgraph and also the singleton connected components (nodes outside the ). Let us define the set of the nodes that belong in the connected component at the step , such that and for any .
Using the update rule (15), Algorithm 3 shows how mRBK is updating the private values of the nodes of the network (see also Figure 2 for the graphical interpretation).

For each connected component of , replace the values of its nodes with:
(16) 
Any other node :
Note that in the update rule of mRBK the nodes that are not attached to a selected edge (do not belong in the subgraph ) update their values via . By considering these nodes as singleton connected components their update rule is exactly the same with the nodes of subgraph . This is easy to see as follows:
(17)  
Remark III.3
In the special case that only one edge is selected in each iteration () the update rule of mRBK is simplified to the update rule of mRK. In this case the subgraph is the pair of the two selected edges.
Remark III.4
In [20] it was shown that several existing gossip protocols for solving the average consensus problem are special cases of the simple RBK (Theorem II.3). For example two gossip algorithms that can be cast as special cases of the simple RBK are the path averaging proposed in [3] and the clique gossiping [19]. In path averaging, in each iteration a path of nodes is selected and its nodes update their values to their exact average (). In clique gossiping, the network is already divided into cliques and a through a random procedure a clique is activated and the nodes of it update their values to their exact average (). Since mRBK contains simple RBK as a special case for , we expect that these special protocols can also be accelerated with the addition of momentum parameter .
IiiD Mass preservation
One of the key properties of some of the most efficient randomized gossip algorithms is mass preservation. If a gossip algorithm has this property it means that the sum (and as a result the average) of the private values of the nodes remains fixed during the iterative procedure. That is, The original pairwise gossip algorithm proposed in [4] satisfied the mass preservation property, while exisiting accelerated gossip algorithms [5, 17] preserving a scaled sum.
In this section we show that the two proposed protocols presented above also have a mass preservation property. In particular, we prove mass preservation for the case of the block randomized gossip protocol (Algorithm 3) with momentum. This is sufficient since the Kaczmarz gossip with momentum (mRK) can be cast as special case.
Theorem III.1
We prove the result for the more general Algorithm 3. Assume that in the step of the method connected components are formed. Let the set of the nodes of each connected component be so that and for any . Thus:
(18) 
Let us first focus, without loss of generality, on connected component and simplify the expression for the sum of its nodes: . By substituting this for all into the right hand side of (18) and from the fact that , we get Since , we have , and as a result for all .
Iv Numerical Evaluation
We devote this section to experimentally evaluate the performance of the proposed gossip algorithms: mRK and mRBK. In particular we perform three experiments. In the first two we focus on the performance of the mRK, while in the last one on its block variant mRBK. In comparing the methods with their momentum variants we use the relative error measure where the starting vectors of values are taken to be always Gaussian vectors. For all of our experiments the horizontal axis represents the number of iterations. The networks used in the experiments are the cycle (ring graph), the 2dimension grid and the randomized geometric graph (RGG) with radius . Code was written in Julia 0.6.3.
Iva Impact of momentum parameter on mRK
Recall that in the simple pairwise gossip algorithm the two nodes that exchange information update their values to their exact average while all the other nodes remain idle. In our framework this method can be cast as special case of mRK when and . In this experiment we keep always the stepsize to be which means that the pair of the chosen nodes update their values to their exact average. We show that by choosing a suitable momentum parameter we can have faster convergence for all networks under study. See Figure 3 for more details.
IvB Comparison with the ShiftRegister
In this experiment we compare mRK with the shift register case when we choose the and in such a way in order to satisfy the connection establish in Section IIIB. That is, we choose for any choice of . Observe that in all plots of Figure 4 our algorithm outperform the corresponding shiftregister case.
IvC Impact of momentum parameter on mRBK
In this experiment our goal is to show that the addition of momentum accelerates the RBK gossip algorithm proposed in [20]. Without loss of generality we choose the block size to be always equal to . That is the random matrix in the update rule of mRBK is always a column submatrix of the indetity matrix. Thus, in each iteration edges of the network are chosen to form the subgraph and the values of the nodes are updated according to Algorithm 3. Note that similar plots can be obtained for any choice of block size. We run all algorithms with fixed stepsize . It is obvious that by choosing a suitable momentum parameter we have faster convergence than when , for all networks under study. See Figure 5 for more details.
V Conclusion and Future research
In this paper we present new accelerated randomized gossip algorithms using tools from numerical linear algebra and the area of randomized Kaczmarz methods for solving linear systems. In particular, using recently developed results on the stochastic reformulation of consistent linear systems we explain how stochastic heavy ball method for solving a specific quadratic stochastic optimization problem can be interpreted as gossip algorithm. To the best of our knowledge, it is the first time that such protocols are presented for average consensus problem. We believe that this work opens up many possible future venues for research. For example, using other Kaczmarztype methods to solve particular linear systems we can obtain novel distributed protocols for average consensus. In addition, we believe that the gossip protocols presented in this work can be extended to the more general setting of distributed optimization where the goal is to minimize the average of convex functions in a distributed fashion.
References
 [1] N. S. Aybat and M. Gürbüzbalaban. Decentralized computation of effective resistances and acceleration of consensus algorithms. In Signal and Information Processing (GlobalSIP), 2017 IEEE Global Conference on, pages 538–542. IEEE, 2017.
 [2] T.C. Aysal, M.E. Yildiz, A.D. Sarwate, and A. Scaglione. Broadcast gossip algorithms for consensus. IEEE Trans. Signal Process., 57(7):2748–2761, 2009.
 [3] F. Bénézit, A.G. Dimakis, P. Thiran, and M. Vetterli. Orderoptimal consensus through randomized path averaging. IEEE Trans. Inf. Theory, 56(10):5150–5167, 2010.
 [4] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE Transactions on Information Theory, 14(SI):2508–2530, 2006.
 [5] M. Cao, D.A. Spielman, and E.M. Yeh. Accelerated gossip algorithms for distributed computation. In Proc. of the 44th Annual Allerton Conference on Communication, Control, and Computation, pages 952–959, 2006.
 [6] G. Cybenko. Dynamic load balancing for distributed memory multiprocessors. J. Parallel Distrib. Comput., 7(2):279–301, 1989.
 [7] Morris H DeGroot. Reaching a consensus. Journal of the American Statistical Association, 69(345):118–121, 1974.
 [8] A.G. Dimakis, S. Kar, J.M.F. Moura, M.G. Rabbat, and A. Scaglione. Gossip algorithms for distributed signal processing. Proceedings of the IEEE, 98(11):1847–1864, 2010.
 [9] A.G. Dimakis, A.D. Sarwate, and M.J. Wainwright. Geographic gossip: Efficient averaging for sensor networks. IEEE Trans. Signal Process., 56(3):1205–1216, 2008.
 [10] Y.C. Eldar and D. Needell. Acceleration of randomized Kaczmarz method via the Johnson–Lindenstrauss lemma. Numerical Algorithms, 58(2):163–177, 2011.
 [11] N.M. Freris and A. Zouzias. Fast distributed smoothing of relative measurements. In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, pages 1411–1416. IEEE, 2012.
 [12] R.M. Gower and P. Richtárik. Randomized iterative methods for linear systems. SIAM. J. Matrix Anal. & Appl., 36(4):1660–1690, 2015.
 [13] R.M. Gower and P. Richtárik. Stochastic dual ascent for solving linear systems. arXiv preprint arXiv:1512.06890, 2015.
 [14] F. Hanzely, J. Konečný, N. Loizou, P. Richtárik, and D. Grishchenko. Privacy preserving randomized gossip algorithms. arXiv preprint arXiv:1706.07636, 2017.
 [15] S. Kaczmarz. Angenäherte auflösung von systemen linearer gleichungen. Bulletin International de l’Academie Polonaise des Sciences et des Lettres, 35:355–357, 1937.
 [16] A. Krizhevsky, I. Sutskever, and G.E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
 [17] J. Liu, B.D.O. Anderson, M. Cao, and A.S. Morse. Analysis of accelerated gossip algorithms. Automatica, 49(4):873–883, 2013.
 [18] J. Liu and S. Wright. An accelerated randomized Kaczmarz algorithm. Mathematics of Computation, 85(297):153–178, 2016.
 [19] Yang Liu, Bo Li, Brian Anderson, and Guodong Shi. Clique gossiping. arXiv preprint arXiv:1706.02540, 2017.
 [20] N. Loizou and P. Richtárik. A new perspective on randomized gossip algorithms. In 4th IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2016.
 [21] N. Loizou and P. Richtárik. Linearly convergent stochastic heavy ball method for minimizing generalization error. NIPSWorkshop on Optimization for Machine Learning [arXiv preprint arXiv:1710.10737], 2017.
 [22] N. Loizou and P. Richtárik. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677, 2017.
 [23] A. Ma, D. Needell, and A. Ramdas. Convergence properties of the randomized extended GaussSeidel and Kaczmarz methods. SIAM Journal on Matrix Analysis and Applications, 36(4):1590–1604, 2015.
 [24] A. Nedić, A. Olshevsky, and M. G. Rabbat. Network topology and communicationcomputation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953–976, 2018.
 [25] D. Needell. Randomized Kaczmarz solver for noisy linear systems. BIT Numerical Mathematics, 50(2):395–403, 2010.
 [26] D. Needell and J.A. Tropp. Paved with good intentions: analysis of a randomized block Kaczmarz method. Linear Algebra and its Applications, 441:199–221, 2014.
 [27] D. Needell, R. Zhao, and A. Zouzias. Randomized block Kaczmarz method with projection for solving least squares. Linear Algebra and its Applications, 484:322–343, 2015.
 [28] A. Olshevsky. Linear time average consensus on fixed graphs and implications for decentralized optimization and multiagent control. arXiv preprint arXiv:1411.4186, 2014.
 [29] A. Olshevsky and J.N. Tsitsiklis. Convergence speed in distributed consensus and averaging. SIAM J. Control Optim., 48(1):33–55, 2009.
 [30] P. Richtárik and M. Takáč. Stochastic reformulations of linear systems: algorithms and convergence theory. arXiv:1706.01108, 2017.
 [31] F. Schöpfer and D.A. Lorenz. Linear convergence of the randomized sparse Kaczmarz method. arXiv preprint arXiv:1610.02889, 2016.
 [32] T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl., 15(2):262–278, 2009.
 [33] I. Sutskever, J. Martens, G.E. Dahl, and G.E. Hinton. On the importance of initialization and momentum in deep learning. ICML (3), 28:1139–1147, 2013.
 [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
 [35] John Tsitsiklis, Dimitri Bertsekas, and Michael Athans. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE transactions on automatic control, 31(9):803–812, 1986.
 [36] A.C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.
 [37] L. Xiao and S. Boyd. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65–78, 2004.
 [38] L. Xiao, S. Boyd, and S. Lall. A scheme for robust distributed sensor fusion based on average consensus. In Information Processing in Sensor Networks, 2005. IPSN 2005. Fourth International Symposium on, pages 63–70. IEEE, 2005.
 [39] A. Zouzias and N.M. Freris. Randomized extended Kaczmarz for solving least squares. SIAM. J. Matrix Anal. & Appl., 34(2):773–793, 2013.
 [40] A. Zouzias and N.M. Freris. Randomized gossip algorithms for solving Laplacian systems. In Control Conference (ECC), 2015 European, pages 1920–1925. IEEE, 2015.