Quantized Pushsum for Gossip and Decentralized Optimization over Directed Graphs
Abstract
We consider a decentralized stochastic learning problem where data points are distributed among computing nodes communicating over a directed graph. As the model size gets large, decentralized learning faces a major bottleneck that is the heavy communication load due to each node transmitting large messages (model updates) to its neighbors. To tackle this bottleneck, we propose the quantized decentralized stochastic learning algorithm over directed graphs that is based on the pushsum algorithm in decentralized consensus optimization. More importantly, we prove that our algorithm achieves the same convergence rates of the decentralized stochastic learning algorithm with exactcommunication for both convex and nonconvex losses. A key technical challenge of the work is to prove exact convergence of the proposed decentralized learning algorithm in the presence of quantization noise with unbounded variance over directed graphs. We provide numerical evaluations that corroborate our main theoretical results and illustrate significant speedup compared to the exactcommunication methods.
1 Introduction
In modern machine learning applications we typically confront solving optimization problems with massive data sets which demands utilizing multiple computing agents to accelerate convergence. Furthermore, in some applications each processing agent has its own local data set. However, communicating data among different workers is often impractical from multiple aspects such as privacy and bandwidth utilization. In such settings, computing nodes rely on their own data to run (stochastic) gradient descent algorithm while exchanging parameters with other workers in each iteration to ensure converging to the optimal solution of the (global) objective. More precisely, the goal of decentralized learning is to optimize a function defined as the average of functions , of all computing nodes, i.e.,
(1) 
Here, each local function can be considered as the loss incurred over the local dataset of node :
(2) 
where is loss function, is the dataset size of node , and denotes the dimensional local data points. Classical decentralized setting consists of two steps. First each computing node runs (stochastic) gradient descent algorithm on its local function using its local parameter and dataset. Then the local parameters are exchanged between neighbour workers to compute a weighted average of neighbouring nodes’ parameters. Local parameters of every node at iteration is then obtained by a combination of the weighted average of neighboring nodes solutions and a negative descent direction based on local gradient direction. In particular, if we define as the decision variable of node at step , then its local update can be written as
(3) 
Here is the stepsize and is the stochastic gradient of function evaluated using a random sample of dataset of node . Also, matrix represents the weights used in the averaging step and in particular is the weight that node assigns to node . It can be shown that under some conditions for the weight matrix , the iterates of the update in (3) converge to the optimal solution when local functions are convex [YLY16] and to a firstorder stationary point in the nonconvex case [LZZ+17]. Most commonly in these settings, workers communicate over an undirected connected graph , and to derive these theoretical results the weight matrix should have the following properties:

is doubly stochastic, i.e., , where 1 indicates an dimensional vector of all ’s;

is symmetric: ;

The spectral gap of is strictly positive, i.e, the second largest eigenvalue of satisfies .
It can be shown that these assumptions guarantee that the th power of , i.e., , converges to the matrix at a linear rate (i.e., exponentially fast); see, e.g., Lemma 5 in [LZZ+17]. This ensures the consensus among different workers in estimating the optimum solution . However for directed graphs, satisfying the first and second constraints are not generally possible. Over the last few years there have been several works to tackle decentralized optimization over directed graphs, e.g., [BHO+05] showed that for rowstochastic matrices with positive entries on the diagonal, the matrix converges to at a geometric rate, where is a stochastic vector. Based on this result [NO14] proposed the pushsum algorithm for decentralized optimization over (timevarying) directed graphs. The basic intuition is that the algorithm estimates the vector by a vector which is being updated among all workers in each iteration.
However, the mentioned algorithms for decentralized settings over directed graphs require exchanging the agents’ model exactly and without any error in order to guarantee convergence to a desired solution. As the model size gets large, e.g., in deep learning algorithms, one can see that communication overhead of each iteration becomes the major bottleneck. Parameter quantization is the major approach to tackle this issue. Although this approach might increase the overall number of iterations, the reduction in the communication cost leads to an efficient algorithm for optimization or gossip problems with large model size. In this paper, we exploit the idea of compressing signals for communication to propose the first quantized algorithms for gossip and decentralized learning algorithm over directed graphs. A summary of our major contributions follows.

We propose the first algorithm for communicationefficient gossip type problems and decentralized stochastic learning over directed graphs. Importantly, our algorithm guarantees converging to the optimal solution.

We prove that our proposed method converges at the same rate as pushsum with exact quantization. In particular for gossip type problems we show convergence in . For stochastic learning problems with convex objectives over a directed network with nodes, we show that the objective loss converges to the optimal solution with the rate . For nonconvex objectives we show that squared norm of the gradient converges with the rate , suggesting convergence to a stationary point with the optimal rate.

The proposed algorithms demonstrate significant speedup for communication time compared to the exactcommunication method for gossip and decentralized learning in our experiments.
Notation. We use boldface notation for vectors and small and large letters for scalars and matrices respectively. denotes the transpose of the matrix . Mean of rows of a matrix , is denoted by . We use to denote the set of nodes . denotes the norm of a matrix or vector depending on its argument. The th row of the matrix is represented by . The identity matrix of size is denoted by and the dimensional zero vector is denoted by . To simplify notation we represent the dimensional vector of all ’s as 1, where is the number of computing nodes.
1.1 Prior Work
Gossip and decentralized optimization. The consensus problems over graphs are generally called Gossip problems [SM03, XB04, JLM03]. In the gossip type problems, the goal of each node is to reach the average of initial values of all nodes over an undirected graph. Over the last few years there have been numerous research works which consider the convergence of decentralized optimization for undirected graphs [NO09, YLY16, SLW+15]. In particular, [NO09, YLY16] prove the convergence of decentralized algorithm for convex and Lipschitz functions, while [LZZ+17] prove the convergence of stochastic gradient descent for nonconvex and smooth loss functions with the rate . [SLW+15] propose the EXTRA algorithm which using gradient descent converges at a linear rate for stronglyconvex losses.
Decentralized learning over directed graphs. The first study of pushsum scheme for gossip type problems in directed graphs is discussed in [KDG03]. Authors in [TLR12] extend this method to decentralized optimization problems and show the convergence of pushsum for convex loss functions. [NO14] extend these results to timevarying directed uniformly strongly connected graphs. General algorithms for reaching linear convergence rate in decentralized optimization e.g., EXTRA can be combined with the pushsum algorithm to obtain similar results for stronglyconvex objective function in directed graphs. See the following works for interesting discussions regarding this topic [ZY15], [XWK17, XMX+18], [XK18], [XK17], [XK15], [NOS17].
Quantized decentralized learning. [NOO+09], discuss the effect of quantization in decentralized gossip problems using quantization with constant variance. However they show that these algorithms converge to the average of the initial values of the agents within some error. The first exact algorithm was proposed in [AGL+17] in the context of masterworker distributed optimization. [RTM+19b, RMH+19a] study exact convergence of decentralized optimization while nodes exchange quantized models over undirected graphs. Considering an unbiased quantizer with constant variance, they prove (sub) optimal convergence rates for convex and nonconvex objective functions. [DMR18, KSJ19] propose algorithms for exact convergence with optimal rates for decentralized stochastic learning with convex objectives over undirected graphs.
2 Network Model
We consider a directed graph , where is the set of nodes and denotes the set of directed edges of the graph. We say there is a link from node to node when . Indeed, as the graph is directed this does not guarantee that there is also a link from to , i.e., . The sets of inneighbors and outneighbors of node are defined as:
We denote to be the outdegree of node . Throughout this paper we assume that is strongly connected.
Assumption 1 (Graph structure).
Graph of workers is strongly connected.
Additionally we assume that the weight matrix has nonnegative entries and each node uses its own parameter as well as its inneighbors. This implies that all vertices of graph have selfloops. Also we assume that the weight matrix is column stochastic.
Assumption 2 (Weight matrix).
Matrix is column stochastic, all entries are nonnegative and all entries on the diagonal are positive.
Given the above assumptions, we next state a key result from [NO14, ZY15] that will be useful in our analysis.
Proposition 2.1.
Note that the columnstochastic property of the weight matrix is considerably weaker than doublestochastic property. As explained in the next example, each computing node can use its own outdegree to form the ’th column of weightmatrix. Thus the weight matrix can be constructed in the decentralized setting without each node knowing or the structure of the graph.
Example 1.
Consider a strongly connected network of computing nodes where for all , and each node has a selfloop. It is straightforward to see that is column stochastic and all entries on the diagonal are positive. Therefore the constructed weight matrix satisfies Assumption 2.
3 Pushsum for Directed Graphs
Before explaining our main contributions on quantized decentralized learning, we discuss gossip or consensus algorithms over directed graphs. Consensus algorithms in the decentralized setting are denoted as gossip algorithms. In this problem, workers are exchanging their parameters at time over a connected graph. The goal is to reach the average of initial values of all nodes, i.e., , at every node, guaranteeing consensus among workers. The gossip algorithm is based on the weighted average of parameters of neighboring nodes, i.e., [XB04] showed that for doubly stochastic graphs with spectral gap smaller than one, the weight matrix converges in linear iterations to the average matrix; thus, asymptotically converges to which guarantees convergence to the initial mean with linear rate. The condition on being columnstochastic guarantees that average of workers is preserved in each iteration, i.e., . On the other hand, if the weight matrix is not rowstochastic, converges to , where is the th entry of the stochastic vector with the property that . The main approach to tackle consensus in directed networks or for nondoubly stochastic weight matrices is the pushsum protocol introduced for the first time in [KDG03]. In the pushsum algorithm each worker updates its auxiliary scalar variable according to the following rule:
Note that the matrix is column stochastic but not necessarily rowstochastic, thus one can see that if the scalars are initialized with , for all , then
Thus, this expression implies that as we have
which implies that for all :
This shows the asymptotic convergence of to the initial mean. Since the parameters and are kept locally at each node, in every iteration each node can obtain its variable in the decentralized setting. Based on this approach, we present a communicationefficient algorithm for Gossip over directed networks which uses quantization for reducing communication time (Section 4.1). Moreover, we will show exact convergence with the same rate as the pushsum algorithm without quantization.
3.1 Extension to decentralized optimization
As studied in [TLR12] the pushsum method for reaching consensus among nodes can be extended to decentralized convex optimization problems with some modifications. The pushsum algorithm for decentralized optimization with exact communication, can be summarized in the following steps:
Here, local gradients are computed at the scaled parameters , while the parameters are obtained similar to the gossip pushsum method. It is shown by [TLR12] that for all nodes and all the iterates of gossip pushsum method satisfy
for and denoting the weighted time average of for . This result indicates that for column stochastic matrices, the pushsum protocol achieves the optimal solution at a sublinear rate of . In the following section, we show that one can obtain the similar complexity bound even for the case that nodes exchange quantized signals Section 4.2.
4 Quantized Pushsum for Directed Graphs
In this section, we propose two variants of the pushsum method with quantization for both gossip (consensus) and decentralized optimization over directed graphs.
4.1 Quantized pushsum for Gossip
We present a quantized gossip algorithm for the consensus problem in which nodes communicate quantized parameters over a directed graph. For this purpose we use the pushsum protocol [TLR12] combined with the quantization scheme introduced in [KSJ19] in the context of decentralized gossip and optimization over symmetric doubly stochastic graphs. The steps of our proposed algorithm are described in Algorithm 1. Basically, Algorithm 1 consists of two parts: First, the “Quantization” step, in which each node computes
where is an auxiliary parameter stored locally at each node and is being updated at each iteration. Importantly every node , communicates to its outneighbors. Quantizing and communicating instead of is a crucial part of the algorithm as it guarantees that the quantization noise asymptotically vanishes. Second part of the proposed algorithm is the “Averaging” step, in which every node updates in parallel its parameters . The variables and are updated similar to the pushsum algorithm. For updating , the algorithm uses estimates of the value of of its inneighbors, denoted by . Each worker keeps track of the auxiliary parameters of its inneighbors with updating it with received from them:
Using the same initialization for all kept locally in all outneighbors of node , one can see that remains the same among all outneighbors of node for all iterations . Similar to the pushsum protocol with exact quantization, the role of in Algorithm 1 is to scale the parameters of all nodes in order to obtain which is the estimation of the average of initial values of nodes .
4.2 Quantized pushsum for decentralized optimization
Using the pushsum method for optimization problems we propose Algorithm 2 for communication efficient collaborative optimization over Directed Graphs. Similar to the method described in Algorithm 1, this method also has the Quantization and Averaging steps, with the difference that the update rule for is similar to one iteration of the stochastic gradient descent:
Importantly, we note that stochastic gradients are evaluated at the scaled values . One can observe that similar to Algorithm 1, here the role of is to correct the parameters through scaling with .
In Section 5, we show that the locally kept parameters will reach consensus at the rate for . Furthermore, with the same step size, the time average of the parameters for will converge to the optimal solution for convex losses and it will converge to a stationary point for nonconvex losses with the same rate as Decentralized Stochastic Gradient Descent (DSGD) with exact communication. This reveals that quantization and structure of the graph (e.g, directed or undirected) have no effect on the rate of convergence under the proposed algorithm, however, these dependencies on quantization and the structure of graph appear in the terms of the upper bound.
5 Convergence Analysis
In this section, we study convergence properties of our proposed schemes for quantized gossip and decentralized stochastic learning with convex and nonconvex objectives. To do so, we first assume the following conditions on the quantization scheme and loss functions are satisfied.
Assumption 3 (Quantization Scheme).
The quantization function satisfies for all :
(6) 
where .
In the following, we mention a specific quantization scheme and formally state its parameter .
Example 2 (Lowprecision Quantizer).
For convenience we assume that the parameters and of all nodes are initialized with zero vectors. This assumption is without loss of generality and is taken to simplify the resulting convergence bounds.
Assumption 4 (Initialization).
The parameters and are initialized with and for all .
Additionally we make the following assumptions on the local objective function of each computing node and its stochastic gradient.
Assumption 5 (Lipschitz Local Gradients).
Local functions , have Lipschitz gradients i.e., for all
Assumption 6 (Bounded Stochastic Gradients).
Local stochastic gradients have bounded second moment i.e., for all
Assumption 7 (Bounded Variance).
Local stochastic gradients have bounded variance i.e., for all
First, we show that in Algorithm 1, the parameters of all nodes recover the exact value of initial mean in linear iterations. For convenience we denote by and .
Theorem 5.1 (Gossip).
This bound signifies the effect of parameters related to the structure of graph and weight matrix, i.e. and the quantization parameter . In particular emerges as the coefficient of the larger term, and choosing which corresponds to gossip with exact communication, results in convergence with rate .
Next, we show that the quantization method for decentralized stochastic learning over directed graphs as described in Algorithm 2 converges to the optimal solution with optimal rates for convex objectives. In particular we show that global objective function evaluated at time average of converges to the optimal solution after iterations with the rate . The next theorem characterizes the convergence bound for Algorithm 2 with convex objectives. For compactness we define constants and
Theorem 5.2 (Convex Objectives).
Remark 1.
In the proof of the theorem we show that (see (9)) for fixed arbitrary the error decays with the rate . The inequality in the statement of the theorem follows by the specified choice of . More importantly, we highlight that the largest term in (5.2), i.e., , is directly proportionate to and which emphasizes the impact of the number of workers and minibatch size in accelerating the speed of convergence. Additionally parameters related to the structure of graph, i.e., and and the quantization parameter , only appear in the terms of order and which are asymptotically negligible compared to .
In the next theorem we show convergence of Algorithm 2 with nonconvex objectives. Importantly, we demonstrate that the gradient of global function converges to the zero vector with the same optimal rate as in decentralized optimization with exactcommunication over undirected graphs(See [LZZ+17]).
Theorem 5.3 (Nonconvex Objectives).
Moreover for all and , it holds that
(10) 
Remark 2.
The inequality (5.3) implies convergence of the average of among workers to a stationary point as well as convergence of average of local gradients to zero with the optimal rate for nonconvex objectives. Interestingly similar to the convexcase discussed in Remark 1, the number of workers and stochastic gradient variance emerge in the dominant terms while the parameters related to weight matrix, graph structure and quantization appear in the term of order .
Remark 3.
As the proof of theorem shows, for arbitrary fixed step size we derive the inequality in (10) and the desired result of theorem is concluded by the specified choice of in the theorem. Importantly we note that the condition on the number of iterations ,i.e. , is a direct result of the specified choice of . Therefore one can get the same rate for convergence for all , with other choices for the step size. For example using the relation (10), by choosing we achieve convergence for all
Remark 4.
Based on (10), the parameters of nodes reach consensus with the rate for the specified value of . For an arbitrary value of , consensus is achieved with the rate (see (28)) which implies that smaller values of will result in faster consensus among the nodes. However due to the term in the convergence of objective to the optimal solution or in the convergence to a stationary point, this fast consensus will be at the cost of slower convergence of objective function in both convex and nonconvex cases.
6 Numerical Experiments
In this section, we compare the proposed methods for communicationefficient message passing over directed graphs, with the pushsum protocol using exact communication (e.g., as formulated in [KDG03, TLR12] for gossip or optimization problems). Throughout the experiments we use the stronglyconnected directed graphs and with vertices as illustrated in Figure 1. For each graph we form the columnstochastic weight matrix according to the method described in Example 1. In order to study the effect of graph structure we consider to be more dense with more connection between nodes. For quantization, we use the method discussed in 2 with bits used for each entry of the vector (one bit is allocated for the sign). Moreover the norm of transmitted vector and the scalars are transmitted without quantization. In the pushsum protocol with exact communication the entries of the vector and the scalar are each transmitted with 54 bits.
Gossip experiments. First, we evaluate performance of Algorithm 5.1 for gossip type problems. We initialize the parameters of all nodes to be uniformly distributed random variables. Moreover we initialize the auxiliary parameters and for all . In Figure 4(Left) we compare the performance of Algorithm 1 with the pushsum protocol with exactcommunication for both networks and . While Algorithm 1 has almost the same performance as pushsum over , it is outperformed by pushsum over .
However the superiority of exactcommunication methods compared in each iteration could be predicted. In order to compare the two methods based on time spent to reach a specific level of error, we compare their performances based on the number of bits that each worker communicates. In Figure 4 (Right) the number of bits required to reach a certain error performance is illustrated for both methods. For the graphs and we observe up to 10x and 6x reduction in the total number of bits, respectively.
Decentralized optimization experiments. Next, we study the performance of Algorithm 2 for decentralized stochastic optimization using convex and nonconvex objective functions. First, we consider the objective
where we set and . Thus each node has access to its local dataset and is using one sample at random in each iteration to obtain the stochastic gradient of its own local function. Here we use the dataset generated according to
where is a fixed vector initialized as . The step size for each setting, is finetuned up to iteration among values in the interval . The Loss at iteration is presented by , where is the timeaverage of the model of worker and is the optimal solution.
The results of this experiment are in Figure 7 (Left) which illustrates the convergence of Algorithm 2 based on the number of iterations for different levels of quantization and over the two graphs and . The nonquantized method outperforms the quantized methods based on iteration. This is due to the quantization noise injected in the flow of information over the graph which depends on the number of bits each node uses for encoding and the structure of graph. However this error asymptotically vanishes resulting in small overall quantization noise. This implies that with less quantization noise (i.e., using more bits to encode) the loss decay based on iteration number gets smaller. However as we observe in Figure 7 (Right), more quantization levels will result in larger number of bits required to achieve a certain level of loss. Consequently, the pushsum protocol with exact communication for optimization over directed graphs is not communicationefficient as we demonstrated that using smaller number of bits with Algorithm 2 results in x reduction in transmitted bits.
As we showed in Theorem 5.3 in Section 5, our proposed method guarantees convergence to a stationary point for nonconvex and smooth objective functions. In order to illustrate this, we train a neuralnetwork with hidden units with sigmoid activation function to classify the MNIST dataset into classes. We use the graph with nodes where each node has access to samples of dataset and uses a randomly selected minibatch of size for computing the local stochastic gradient descent. For each setting, the stepsize is finetuned up to iteration and over values in the interval . Figure 10 illustrates the results for training loss of two aforementioned methods based on number of iteration (Left) and total number of bits communicated between two neighbor nodes (Right). We note the close gap in each iteration between the loss decay of our proposed method with bits quantization, and the pushsum with exact communication. However since our method uses significantly less bits in each iteration, it reaches the same training loss in fewer iterations. In particular Figure 10 (Right) demonstrates x reduction in total number of bits communicated using our proposed method.
7 Conclusion and Future Work
In this paper, we proposed a scheme for communicationefficient decentralized learning over directed graphs. We showed that our method converges at the same convergence rate as nonquantized methods for both gossip and decentralized optimization problems. As we demonstrated in Section 6, the proposed approach results in significant improvements in communicationtime of the pushsum protocol. An interesting future direction is extending these results to algorithms that achieve linear convergence for stronglyconvex problems (e.g., [XK17]). Another direction is extending our results to asynchronous decentralized optimization over directed graphs where unlike our setting the parameters on nodes are received and updated with delay.
Appendix
In this section, we first provide further numerical experiments as well as presenting the hyperparameters of all methods used in the experiments. We then continue with presenting proofs of the Theorems 5.15.3.
7.1 Neural Network Training on CIFAR10 dataset
We train a neural network of 20 hiddenunits with sigmoid activation functions for binary classification of the CIFAR10 dataset. We use the topology of as in Figure 1 with the corresponding weight matrix constructed according to Example 1. The value of step size is finetuned among 15 values of uniformly selected in and up to iteration 300 for both algorithms.
Figure 11 illustrates the performance of the proposed algorithm based on iteration number (Left) and number of bits communicated between two neighbor nodes (Right), and compares the vanilla pushsum method and the proposed communicationefficient method with 8 bits for quantization. We use SGD with minibatch size of 100 samples for each node in both methods. We highlight the close similarity in training loss of the two methods for a fixed number of iteration. More importantly the proposed method uses remarkably smaller number of bits implying faster communication while reaching the same level of training loss.
7.2 Details of the Numerical Experiments
The stepsizes of the algorithms used in Section 6 are fine tuned over the interval , so that the best error achieved by each method is compared. In Table 1 we present the finetuned step sizes as well as model size (i.e. dimension of parameteres) and the minibatch size of the algorithms that are used throughout the numerical experiments.
Objective  iteration  model size  minibatch size  graph  step size 

Square loss  50  256  1  1.7  
Square loss  50  256  1  1.1  
NN, MNIST  200  7960  10  2.2  
NN, CIFAR10  300  20542  100  1.1  
Square loss (4bits)  50  256  1  1.1  
Square loss (16bits)  50  256  1  1.1  
NN, MNIST (8bits)  200  7960  10  1.9  
NN, CIFAR10 (8bits)  300  20542  100  0.3 
Proofs
Notation
Throughout this section we set the following notation. For variables, stochastic gradients and gradients, we concatenate the row vectors corresponding to each node to form the following matrices:
8 Proof of Theorem 5.1 : Quantized Gossip over Directed Graphs
First we write the iterations of Algorithm 1 in matrix notation to derive the following:
(11) 
Based on this, we can rewrite the update rule for as follows:
By repeating this for , the update rule for takes the following shape:
(12) 
Multiplying both sides by and recalling that , yields that
(13) 
With this and (12) we have for all :
(14) 
Furthermore by the iterations of Algorithm 11 as well as the assumption on quantization noise in Assumption 3 we find that
Next, we add and subtract to the RHS and also use the fact that ( [ZY15]) to conclude that
(15) 
Let and , then we conclude that
Denoting by and , we derive the next two inequalities based on (14) and (15):
(16) 
Lemma 8.1.
The iterates of satisfy for all iterations
where for the values of chosen such that .