Quantized Push-sum for Gossip and Decentralized Optimization over Directed Graphs

# Quantized Push-sum for Gossip and Decentralized Optimization over Directed Graphs

## Abstract

We consider a decentralized stochastic learning problem where data points are distributed among computing nodes communicating over a directed graph. As the model size gets large, decentralized learning faces a major bottleneck that is the heavy communication load due to each node transmitting large messages (model updates) to its neighbors. To tackle this bottleneck, we propose the quantized decentralized stochastic learning algorithm over directed graphs that is based on the push-sum algorithm in decentralized consensus optimization. More importantly, we prove that our algorithm achieves the same convergence rates of the decentralized stochastic learning algorithm with exact-communication for both convex and non-convex losses. A key technical challenge of the work is to prove exact convergence of the proposed decentralized learning algorithm in the presence of quantization noise with unbounded variance over directed graphs. We provide numerical evaluations that corroborate our main theoretical results and illustrate significant speed-up compared to the exact-communication methods.

## 1 Introduction

In modern machine learning applications we typically confront solving optimization problems with massive data sets which demands utilizing multiple computing agents to accelerate convergence. Furthermore, in some applications each processing agent has its own local data set. However, communicating data among different workers is often impractical from multiple aspects such as privacy and bandwidth utilization. In such settings, computing nodes rely on their own data to run (stochastic) gradient descent algorithm while exchanging parameters with other workers in each iteration to ensure converging to the optimal solution of the (global) objective. More precisely, the goal of decentralized learning is to optimize a function defined as the average of functions , of all computing nodes, i.e.,

 (1)

Here, each local function can be considered as the loss incurred over the local data-set of node :

 (2)

where is loss function, is the data-set size of node , and denotes the -dimensional local data points. Classical decentralized setting consists of two steps. First each computing node runs (stochastic) gradient descent algorithm on its local function using its local parameter and data-set. Then the local parameters are exchanged between neighbour workers to compute a weighted average of neighbouring nodes’ parameters. Local parameters of every node at iteration is then obtained by a combination of the weighted average of neighboring nodes solutions and a negative descent direction based on local gradient direction. In particular, if we define as the decision variable of node at step , then its local update can be written as

 xi(t+1)=n∑j=1aijxj(t)−α(t)∇Fi(xi(t),ζi,t). (3)

Here is the stepsize and is the stochastic gradient of function evaluated using a random sample of data-set of node . Also, matrix represents the weights used in the averaging step and in particular is the weight that node assigns to node . It can be shown that under some conditions for the weight matrix , the iterates of the update in (3) converge to the optimal solution when local functions are convex [YLY16] and to a first-order stationary point in the non-convex case [LZZ+17]. Most commonly in these settings, workers communicate over an undirected connected graph , and to derive these theoretical results the weight matrix should have the following properties:

1. is doubly stochastic, i.e., , where 1 indicates an -dimensional vector of all ’s;

2. is symmetric: ;

3. The spectral gap of is strictly positive, i.e, the second largest eigenvalue of satisfies .

It can be shown that these assumptions guarantee that the -th power of , i.e., , converges to the matrix at a linear rate (i.e., exponentially fast); see, e.g., Lemma 5 in [LZZ+17]. This ensures the consensus among different workers in estimating the optimum solution . However for directed graphs, satisfying the first and second constraints are not generally possible. Over the last few years there have been several works to tackle decentralized optimization over directed graphs, e.g., [BHO+05] showed that for row-stochastic matrices with positive entries on the diagonal, the matrix converges to at a geometric rate, where is a stochastic vector. Based on this result [NO14] proposed the push-sum algorithm for decentralized optimization over (time-varying) directed graphs. The basic intuition is that the algorithm estimates the vector by a vector which is being updated among all workers in each iteration.

However, the mentioned algorithms for decentralized settings over directed graphs require exchanging the agents’ model exactly and without any error in order to guarantee convergence to a desired solution. As the model size gets large, e.g., in deep learning algorithms, one can see that communication overhead of each iteration becomes the major bottleneck. Parameter quantization is the major approach to tackle this issue. Although this approach might increase the overall number of iterations, the reduction in the communication cost leads to an efficient algorithm for optimization or gossip problems with large model size. In this paper, we exploit the idea of compressing signals for communication to propose the first quantized algorithms for gossip and decentralized learning algorithm over directed graphs. A summary of our major contributions follows.

• We propose the first algorithm for communication-efficient gossip type problems and decentralized stochastic learning over directed graphs. Importantly, our algorithm guarantees converging to the optimal solution.

• We prove that our proposed method converges at the same rate as push-sum with exact quantization. In particular for gossip type problems we show convergence in . For stochastic learning problems with convex objectives over a directed network with nodes, we show that the objective loss converges to the optimal solution with the rate . For non-convex objectives we show that squared norm of the gradient converges with the rate , suggesting convergence to a stationary point with the optimal rate.

• The proposed algorithms demonstrate significant speed-up for communication time compared to the exact-communication method for gossip and decentralized learning in our experiments.

Notation. We use boldface notation for vectors and small and large letters for scalars and matrices respectively. denotes the transpose of the matrix . Mean of rows of a matrix , is denoted by . We use to denote the set of nodes . denotes the norm of a matrix or vector depending on its argument. The th row of the matrix is represented by . The identity matrix of size is denoted by and the dimensional zero vector is denoted by . To simplify notation we represent the -dimensional vector of all ’s as 1, where is the number of computing nodes.

### 1.1 Prior Work

Gossip and decentralized optimization. The consensus problems over graphs are generally called Gossip problems [SM03, XB04, JLM03]. In the gossip type problems, the goal of each node is to reach the average of initial values of all nodes over an undirected graph. Over the last few years there have been numerous research works which consider the convergence of decentralized optimization for undirected graphs [NO09, YLY16, SLW+15]. In particular, [NO09, YLY16] prove the convergence of decentralized algorithm for convex and Lipschitz functions, while [LZZ+17] prove the convergence of stochastic gradient descent for non-convex and smooth loss functions with the rate . [SLW+15] propose the EXTRA algorithm which using gradient descent converges at a linear rate for strongly-convex losses.

Decentralized learning over directed graphs. The first study of push-sum scheme for gossip type problems in directed graphs is discussed in [KDG03]. Authors in [TLR12] extend this method to decentralized optimization problems and show the convergence of push-sum for convex loss functions. [NO14] extend these results to time-varying directed uniformly strongly connected graphs. General algorithms for reaching linear convergence rate in decentralized optimization e.g., EXTRA can be combined with the push-sum algorithm to obtain similar results for strongly-convex objective function in directed graphs. See the following works for interesting discussions regarding this topic [ZY15], [XWK17, XMX+18], [XK18], [XK17], [XK15], [NOS17].

Quantized decentralized learning. [NOO+09], discuss the effect of quantization in decentralized gossip problems using quantization with constant variance. However they show that these algorithms converge to the average of the initial values of the agents within some error. The first exact algorithm was proposed in [AGL+17] in the context of master-worker distributed optimization. [RTM+19b, RMH+19a] study exact convergence of decentralized optimization while nodes exchange quantized models over undirected graphs. Considering an unbiased quantizer with constant variance, they prove (sub-) optimal convergence rates for convex and non-convex objective functions. [DMR18, KSJ19] propose algorithms for exact convergence with optimal rates for decentralized stochastic learning with convex objectives over undirected graphs.

## 2 Network Model

We consider a directed graph , where is the set of nodes and denotes the set of directed edges of the graph. We say there is a link from node to node when . Indeed, as the graph is directed this does not guarantee that there is also a link from to , i.e., . The sets of in-neighbors and out-neighbors of node are defined as:

 Nini:={j:(j,i)∈E}∪{i}, Nouti:={j:(i,j)∈E}∪{i}.

We denote to be the out-degree of node . Throughout this paper we assume that is strongly connected.

###### Assumption 1 (Graph structure).

Graph of workers is strongly connected.

Additionally we assume that the weight matrix has non-negative entries and each node uses its own parameter as well as its in-neighbors. This implies that all vertices of graph have self-loops. Also we assume that the weight matrix is column stochastic.

###### Assumption 2 (Weight matrix).

Matrix is column stochastic, all entries are non-negative and all entries on the diagonal are positive.

Given the above assumptions, we next state a key result from [NO14, ZY15] that will be useful in our analysis.

###### Proposition 2.1.

Let Assumptions 1 and 2 hold and let be the corresponding weight matrix of workers in a graph . Then, there exist a stochastic vector , and constants and such that for all :

 ∥∥At−ϕ1T∥∥≤Cλt. (4)

Moreover there exists a constant such that for all and

 [At1]i≥δ. (5)

Note that the column-stochastic property of the weight matrix is considerably weaker than double-stochastic property. As explained in the next example, each computing node can use its own out-degree to form the ’th column of weight-matrix. Thus the weight matrix can be constructed in the decentralized setting without each node knowing or the structure of the graph.

###### Example 1.

Consider a strongly connected network of computing nodes where for all , and each node has a self-loop. It is straight-forward to see that is column stochastic and all entries on the diagonal are positive. Therefore the constructed weight matrix satisfies Assumption 2.

## 3 Push-sum for Directed Graphs

Before explaining our main contributions on quantized decentralized learning, we discuss gossip or consensus algorithms over directed graphs. Consensus algorithms in the decentralized setting are denoted as gossip algorithms. In this problem, workers are exchanging their parameters at time over a connected graph. The goal is to reach the average of initial values of all nodes, i.e., , at every node, guaranteeing consensus among workers. The gossip algorithm is based on the weighted average of parameters of neighboring nodes, i.e., [XB04] showed that for doubly stochastic graphs with spectral gap smaller than one, the weight matrix converges in linear iterations to the average matrix; thus, asymptotically converges to which guarantees convergence to the initial mean with linear rate. The condition on being column-stochastic guarantees that average of workers is preserved in each iteration, i.e., . On the other hand, if the weight matrix is not row-stochastic, converges to , where is the th entry of the stochastic vector with the property that . The main approach to tackle consensus in directed networks or for non-doubly stochastic weight matrices is the push-sum protocol introduced for the first time in [KDG03]. In the push-sum algorithm each worker updates its auxiliary scalar variable according to the following rule:

 yi(t+1)=∑j∈Niniaijyj(t).

Note that the matrix is column stochastic but not necessarily row-stochastic, thus one can see that if the scalars are initialized with , for all , then

 Y(t)=AtY(1)=At1.

Thus, this expression implies that as we have

 Y(t)t→∞−−−→n⋅ϕ,

which implies that for all :

 xi(t)yi(t)t→∞−−−→ϕi1T⋅X(1)n⋅ϕi=¯X(1).

This shows the asymptotic convergence of to the initial mean. Since the parameters and are kept locally at each node, in every iteration each node can obtain its variable in the decentralized setting. Based on this approach, we present a communication-efficient algorithm for Gossip over directed networks which uses quantization for reducing communication time (Section 4.1). Moreover, we will show exact convergence with the same rate as the push-sum algorithm without quantization.

### 3.1 Extension to decentralized optimization

As studied in [TLR12] the push-sum method for reaching consensus among nodes can be extended to decentralized convex optimization problems with some modifications. The push-sum algorithm for decentralized optimization with exact communication, can be summarized in the following steps:

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩xi(t+1)=∑nj=1aijxj(t)−α(t)∇fi(zi(t))yi(t+1)=∑j∈Niniaijyj(t)zi(t+1)=xi(t+1)yi(t+1)

Here, local gradients are computed at the scaled parameters , while the parameters are obtained similar to the gossip push-sum method. It is shown by [TLR12] that for all nodes and all the iterates of gossip push-sum method satisfy

 f(˜zi(T))−f(z⋆)≤O(1√T),

for and denoting the weighted time average of for . This result indicates that for column stochastic matrices, the push-sum protocol achieves the optimal solution at a sublinear rate of . In the following section, we show that one can obtain the similar complexity bound even for the case that nodes exchange quantized signals Section 4.2.

## 4 Quantized Push-sum for Directed Graphs

In this section, we propose two variants of the push-sum method with quantization for both gossip (consensus) and decentralized optimization over directed graphs.

### 4.1 Quantized push-sum for Gossip

We present a quantized gossip algorithm for the consensus problem in which nodes communicate quantized parameters over a directed graph. For this purpose we use the push-sum protocol [TLR12] combined with the quantization scheme introduced in [KSJ19] in the context of decentralized gossip and optimization over symmetric doubly stochastic graphs. The steps of our proposed algorithm are described in Algorithm 1. Basically, Algorithm 1 consists of two parts: First, the “Quantization” step, in which each node computes

 Qi(t):=Q(xi(t)−ˆxi(t)),

where is an auxiliary parameter stored locally at each node and is being updated at each iteration. Importantly every node , communicates to its out-neighbors. Quantizing and communicating instead of is a crucial part of the algorithm as it guarantees that the quantization noise asymptotically vanishes. Second part of the proposed algorithm is the “Averaging” step, in which every node updates in parallel its parameters . The variables and are updated similar to the push-sum algorithm. For updating , the algorithm uses estimates of the value of of its in-neighbors, denoted by . Each worker keeps track of the auxiliary parameters of its in-neighbors with updating it with received from them:

 ˆxj(t+1)=ˆxj(t)+Qj(t).

Using the same initialization for all kept locally in all out-neighbors of node , one can see that remains the same among all out-neighbors of node for all iterations . Similar to the push-sum protocol with exact quantization, the role of in Algorithm 1 is to scale the parameters of all nodes in order to obtain which is the estimation of the average of initial values of nodes .

### 4.2 Quantized push-sum for decentralized optimization

Using the push-sum method for optimization problems we propose Algorithm 2 for communication efficient collaborative optimization over Directed Graphs. Similar to the method described in Algorithm 1, this method also has the Quantization and Averaging steps, with the difference that the update rule for is similar to one iteration of the stochastic gradient descent:

 xi(t+1) =xi(t)−ˆxi(t+1)+∑j∈Niniaijˆxj(t+1)−α∇Fi(zi(t+1),ζi,t+1).

Importantly, we note that stochastic gradients are evaluated at the scaled values . One can observe that similar to Algorithm 1, here the role of is to correct the parameters through scaling with .

In Section 5, we show that the locally kept parameters will reach consensus at the rate for . Furthermore, with the same step size, the time average of the parameters for will converge to the optimal solution for convex losses and it will converge to a stationary point for non-convex losses with the same rate as Decentralized Stochastic Gradient Descent (DSGD) with exact communication. This reveals that quantization and structure of the graph (e.g, directed or undirected) have no effect on the rate of convergence under the proposed algorithm, however, these dependencies on quantization and the structure of graph appear in the terms of the upper bound.

## 5 Convergence Analysis

In this section, we study convergence properties of our proposed schemes for quantized gossip and decentralized stochastic learning with convex and non-convex objectives. To do so, we first assume the following conditions on the quantization scheme and loss functions are satisfied.

###### Assumption 3 (Quantization Scheme).

The quantization function satisfies for all :

 E[∥∥Q(x)−x∥∥2]≤ω2∥x∥2, (6)

where .

In the following, we mention a specific quantization scheme and formally state its parameter .

###### Example 2 (Low-precision Quantizer).

[AGL+17] The unbiased stochastic quantization assigns to each entry in , where is the number of levels used for encoding and

 ξi(x,s)=⎧⎨⎩(ℓ+1)/s w.\,p. \; |xi|∥x∥2s−ℓ ,ℓ/sotherwise.

Here is the integer satisfying and . The node at the receiving end, recovers the message according to :

 Q(xi)=∥x∥2⋅sign(xi)⋅ξi(x,s).

This quantization scheme satisfies Assumption 3 with

 ω2=min(d/s2,√d/s).

For convenience we assume that the parameters and of all nodes are initialized with zero vectors. This assumption is without loss of generality and is taken to simplify the resulting convergence bounds.

###### Assumption 4 (Initialization).

The parameters and are initialized with and for all .

Additionally we make the following assumptions on the local objective function of each computing node and its stochastic gradient.

###### Assumption 5 (Lipschitz Local Gradients).

Local functions , have -Lipschitz gradients i.e., for all

###### Assumption 6 (Bounded Stochastic Gradients).

Local stochastic gradients have bounded second moment i.e., for all

 Eζi∼Di∥∥∇Fi(x,ζi)∥∥2≤D2,∀x∈Rd.
###### Assumption 7 (Bounded Variance).

Local stochastic gradients have bounded variance i.e., for all

 Eζi∼Di∥∥∇Fi(x,ζi)−∇fi(x)∥∥2≤σ2,∀x∈Rd.

First, we show that in Algorithm 1, the parameters of all nodes recover the exact value of initial mean in linear iterations. For convenience we denote by and .

###### Theorem 5.1 (Gossip).

Recall the definitions of and in Proposition 2.1. Under Assumptions 1-3, the iterations of Algorithm 1 satisfy for , and all

 E∥∥∥zi(t+1)−1nn∑i=1xi(1)∥∥∥≤4C2δ⋅ωγ∥X(1)∥λ−λ3/2λt/2+2C∥X(1)∥δλt. (7)

This bound signifies the effect of parameters related to the structure of graph and weight matrix, i.e. and the quantization parameter . In particular emerges as the coefficient of the larger term, and choosing which corresponds to gossip with exact communication, results in convergence with rate .

Next, we show that the quantization method for decentralized stochastic learning over directed graphs as described in Algorithm 2 converges to the optimal solution with optimal rates for convex objectives. In particular we show that global objective function evaluated at time average of converges to the optimal solution after iterations with the rate . The next theorem characterizes the convergence bound for Algorithm 2 with convex objectives. For compactness we define constants and

###### Theorem 5.2 (Convex Objectives).

Assume local functions for all are convex, then under Assumptions 1-7 Algorithm 2 for , and satisfies for all

 Ef(1TT∑t=1zi(t+1))−f(z⋆) ≤8L(L+1)√nT∥z⋆∥2+σ2(L+1)4L√nT+nC2(L+1)(L√n2√T+L+1)(ξω2+2nD2)10δ2(1−λ)2L2T. (8)
###### Remark 1.

In the proof of the theorem we show that (see (9)) for fixed arbitrary the error decays with the rate . The inequality in the statement of the theorem follows by the specified choice of . More importantly, we highlight that the largest term in (5.2), i.e., , is directly proportionate to and which emphasizes the impact of the number of workers and mini-batch size in accelerating the speed of convergence. Additionally parameters related to the structure of graph, i.e., and and the quantization parameter , only appear in the terms of order and which are asymptotically negligible compared to .

In the next theorem we show convergence of Algorithm 2 with non-convex objectives. Importantly, we demonstrate that the gradient of global function converges to the zero vector with the same optimal rate as in decentralized optimization with exact-communication over undirected graphs(See [LZZ+17]).

###### Theorem 5.3 (Non-convex Objectives).

Under Assumptions 1-7, Algorithm 2 after sufficiently large iterations , and for and satisfies:

 1TT∑t=1 E∥∥ ∥∥∇f(1nn∑i=1xi(t))∥∥ ∥∥2+12TT∑t=1E∥∥ ∥∥1nn∑i=1∇fi(zi(t+1))∥∥ ∥∥2 ≤σ2L√nT+2L(f(0)−f⋆)√nT+12C2δ2(1−λ)2T(ξω2+2nD2). (9)

Moreover for all and , it holds that

 E∥∥∥zi(T+1)−1nn∑i=1xi(T)∥∥∥2≤6C2nδ2(1−λ)2L2T(ξω2+2nD2). (10)
###### Remark 2.

The inequality (5.3) implies convergence of the average of among workers to a stationary point as well as convergence of average of local gradients to zero with the optimal rate for non-convex objectives. Interestingly similar to the convex-case discussed in Remark 1, the number of workers and stochastic gradient variance emerge in the dominant terms while the parameters related to weight matrix, graph structure and quantization appear in the term of order .

###### Remark 3.

As the proof of theorem shows, for arbitrary fixed step size we derive the inequality in (10) and the desired result of theorem is concluded by the specified choice of in the theorem. Importantly we note that the condition on the number of iterations ,i.e. , is a direct result of the specified choice of . Therefore one can get the same rate for convergence for all , with other choices for the step size. For example using the relation (10), by choosing we achieve convergence for all

###### Remark 4.

Based on (10), the parameters of nodes reach consensus with the rate for the specified value of . For an arbitrary value of , consensus is achieved with the rate (see (28)) which implies that smaller values of will result in faster consensus among the nodes. However due to the term in the convergence of objective to the optimal solution or in the convergence to a stationary point, this fast consensus will be at the cost of slower convergence of objective function in both convex and non-convex cases.

## 6 Numerical Experiments

In this section, we compare the proposed methods for communication-efficient message passing over directed graphs, with the push-sum protocol using exact communication (e.g., as formulated in [KDG03, TLR12] for gossip or optimization problems). Throughout the experiments we use the strongly-connected directed graphs and with vertices as illustrated in Figure 1. For each graph we form the column-stochastic weight matrix according to the method described in Example 1. In order to study the effect of graph structure we consider to be more dense with more connection between nodes. For quantization, we use the method discussed in 2 with bits used for each entry of the vector (one bit is allocated for the sign). Moreover the norm of transmitted vector and the scalars are transmitted without quantization. In the push-sum protocol with exact communication the entries of the vector and the scalar are each transmitted with 54 bits.

Gossip experiments. First, we evaluate performance of Algorithm 5.1 for gossip type problems. We initialize the parameters of all nodes to be uniformly distributed random variables. Moreover we initialize the auxiliary parameters and for all . In Figure 4(Left) we compare the performance of Algorithm 1 with the push-sum protocol with exact-communication for both networks and . While Algorithm 1 has almost the same performance as push-sum over , it is outperformed by push-sum over .

However the superiority of exact-communication methods compared in each iteration could be predicted. In order to compare the two methods based on time spent to reach a specific level of error, we compare their performances based on the number of bits that each worker communicates. In Figure 4 (Right) the number of bits required to reach a certain error performance is illustrated for both methods. For the graphs and we observe up to 10x and 6x reduction in the total number of bits, respectively.

Decentralized optimization experiments. Next, we study the performance of Algorithm 2 for decentralized stochastic optimization using convex and non-convex objective functions. First, we consider the objective

 f(x)=12nmn∑i=1m∑j=1∥∥x−ζij∥∥2,

where we set and . Thus each node has access to its local data-set and is using one sample at random in each iteration to obtain the stochastic gradient of its own local function. Here we use the data-set generated according to

where is a fixed vector initialized as . The step size for each setting, is fine-tuned up to iteration among values in the interval . The Loss at iteration is presented by , where is the time-average of the model of worker and is the optimal solution.

The results of this experiment are in Figure 7 (Left) which illustrates the convergence of Algorithm 2 based on the number of iterations for different levels of quantization and over the two graphs and . The non-quantized method outperforms the quantized methods based on iteration. This is due to the quantization noise injected in the flow of information over the graph which depends on the number of bits each node uses for encoding and the structure of graph. However this error asymptotically vanishes resulting in small overall quantization noise. This implies that with less quantization noise (i.e., using more bits to encode) the loss decay based on iteration number gets smaller. However as we observe in Figure 7 (Right), more quantization levels will result in larger number of bits required to achieve a certain level of loss. Consequently, the push-sum protocol with exact communication for optimization over directed graphs is not communication-efficient as we demonstrated that using smaller number of bits with Algorithm 2 results in x reduction in transmitted bits.

As we showed in Theorem 5.3 in Section 5, our proposed method guarantees convergence to a stationary point for non-convex and smooth objective functions. In order to illustrate this, we train a neural-network with hidden units with sigmoid activation function to classify the MNIST data-set into classes. We use the graph with nodes where each node has access to samples of data-set and uses a randomly selected mini-batch of size for computing the local stochastic gradient descent. For each setting, the step-size is fine-tuned up to iteration and over values in the interval . Figure 10 illustrates the results for training loss of two aforementioned methods based on number of iteration (Left) and total number of bits communicated between two neighbor nodes (Right). We note the close gap in each iteration between the loss decay of our proposed method with bits quantization, and the push-sum with exact communication. However since our method uses significantly less bits in each iteration, it reaches the same training loss in fewer iterations. In particular Figure 10 (Right) demonstrates x reduction in total number of bits communicated using our proposed method.

## 7 Conclusion and Future Work

In this paper, we proposed a scheme for communication-efficient decentralized learning over directed graphs. We showed that our method converges at the same convergence rate as non-quantized methods for both gossip and decentralized optimization problems. As we demonstrated in Section 6, the proposed approach results in significant improvements in communication-time of the push-sum protocol. An interesting future direction is extending these results to algorithms that achieve linear convergence for strongly-convex problems (e.g., [XK17]). Another direction is extending our results to asynchronous decentralized optimization over directed graphs where unlike our setting the parameters on nodes are received and updated with delay.

## Appendix

In this section, we first provide further numerical experiments as well as presenting the hyper-parameters of all methods used in the experiments. We then continue with presenting proofs of the Theorems 5.1-5.3.

### 7.1 Neural Network Training on CIFAR-10 dataset

We train a neural network of 20 hidden-units with sigmoid activation functions for binary classification of the CIFAR-10 dataset. We use the topology of as in Figure 1 with the corresponding weight matrix constructed according to Example 1. The value of step size is fine-tuned among 15 values of uniformly selected in and up to iteration 300 for both algorithms.

Figure 11 illustrates the performance of the proposed algorithm based on iteration number (Left) and number of bits communicated between two neighbor nodes (Right), and compares the vanilla push-sum method and the proposed communication-efficient method with 8 bits for quantization. We use SGD with mini-batch size of 100 samples for each node in both methods. We highlight the close similarity in training loss of the two methods for a fixed number of iteration. More importantly the proposed method uses remarkably smaller number of bits implying faster communication while reaching the same level of training loss.

### 7.2 Details of the Numerical Experiments

The step-sizes of the algorithms used in Section 6 are fine tuned over the interval , so that the best error achieved by each method is compared. In Table 1 we present the fine-tuned step sizes as well as model size (i.e. dimension of parameteres) and the mini-batch size of the algorithms that are used throughout the numerical experiments.

## Proofs

### Notation

Throughout this section we set the following notation. For variables, stochastic gradients and gradients, we concatenate the row vectors corresponding to each node to form the following matrices:

 Z(t):=[z1(t);z2(t)⋯zn(t)]∈Rn×d, ∂F(Z(t),ζt):=[∇F1(z1(t),ζ1,t);∇F2(z2(t),ζ2,t)⋯∇Fn(zn(t),ζn,t)]∈Rn×d, ∂f(Z(t)):=[∇f1(z1(t));∇f2(z2(t))⋯∇fn(zn(t))]∈Rn×d.

## 8 Proof of Theorem 5.1 : Quantized Gossip over Directed Graphs

First we write the iterations of Algorithm 1 in matrix notation to derive the following:

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩Q(t)=Q(X(t)−ˆX(t))ˆX(t+1)=ˆX(t)+Q(t)X(t+1)=X(t)+(A−I)ˆX(t+1)y(t+1)=Ay(t)zi(t+1)=xi(t+1)yi(t+1) (11)

Based on this, we can rewrite the update rule for as follows:

 X(t+1)=AX(t)+(A−I)(ˆX(t+1)−X(t)).

By repeating this for , the update rule for takes the following shape:

 X(t+1)=AtX(1)+t−1∑s=0A(A−I)(X(t−s+1)−X(t−s)). (12)

Multiplying both sides by and recalling that , yields that

 1TX(t+1)=1TX(1). (13)

With this and (12) we have for all :

 ∥∥∥X(t+1)−ϕ1TX(1)∥∥∥=∥∥ ∥∥AtX(1)+t−1∑s=0As(A−I)(ˆX(t−s+1)−X(t−s))−ϕ1TX(1)∥∥ ∥∥⩽Cλt∥∥∥X(1)∥∥∥+2Ct−1∑s=0λs∥∥∥ˆX(t−s+1)−X(t−s)∥∥∥. (14)

Furthermore by the iterations of Algorithm 11 as well as the assumption on quantization noise in Assumption 3 we find that

Next, we add and subtract to the RHS and also use the fact that ( [ZY15]) to conclude that

 (15)

Let and , then we conclude that

 E∥∥∥X(t+1)−ˆX(t+2)∥∥∥=λ1∥∥∥ˆX(t+1)−X(t)∥∥∥+λ′1∥∥∥X(t)−ϕ1TX(1)∥∥∥.

Denoting by and , we derive the next two inequalities based on (14) and (15):

 {ER(t+1)≤C⋅λt∥X(1)∥+2C∑t−1s=0λsU(t−s)EU(t+1)≤λ1U(t)+λ′1R(t) (16)
###### Lemma 8.1.

The iterates of satisfy for all iterations

 U(t)≤ξ1λt/2,

where for the values of chosen such that  .

###### Proof.

Noting that , we have by (16) for all :

 U(t+1)≤λ1U(t)+2C⋅λ1(λt−1∥X(1)∥+t−2∑s=0λsU(t−s−1)). (17)

The proof is based on induction on the inequality in (17). Let , then by (17)

 U(t+1) ≤λ1⋅ξ1⋅λt/2+2Cλ1(λt−1∥X(1)∥)+2Cλ1⋅ξ1⋅λt−121−λ1/2 =λ1⋅ξ1λ1/2λt+12+2Cλ1(∥X(1)∥⋅λt−32)λt+12+2C⋅λ1⋅ξ1λ−λ3/2λt+12 ≤λ1ξ1(1λ1/2+2Cλ−λ3/2)λt+12+2C⋅λ1∥X(1)∥λ⋅λt+12 ≤(12ξ1+2C⋅λ1∥X(1)∥λ)⋅λt+12=ξ1λt+12,

where the last two steps follow by the assumptions of the lemma on and . Moreover for we follow the inequalities similar to (15) to find that

 U(1)≤ω∥X(1