On ConsensusOptimality Tradeoffs in Collaborative Deep Learning
Abstract
In distributed machine learning, where agents collaboratively learn from diverse private data sets, there is a fundamental tension between consensus and optimality. In this paper, we build on recent algorithmic progresses in distributed deep learning to explore various consensusoptimality tradeoffs over a fixed communication topology. First, we propose the incremental consensusbased distributed SGD (iCDSGD) algorithm, which involves multiple consensus steps (where each agent communicates information with its neighbors) within each SGD iteration. Second, we propose the generalized consensusbased distributed SGD (gCDSGD) algorithm that enables us to navigate the full spectrum from complete consensus (all agents agree) to complete disagreement (each agent converges to individual model parameters). We analytically establish convergence of the proposed algorithms for strongly convex and nonconvex objective functions; we also analyze the momentum variants of the algorithms for the strongly convex case. We support our algorithms via numerical experiments, and demonstrate significant improvements over existing methods for collaborative deep learning.
1 Introduction
Scaling up deep learning algorithms in a distributed setting [1, 2, 3] is becoming increasingly critical, impacting several applications such as learning in robotic networks [4], the Internet of Things (IoT) [5, 6], and mobile device networks [7]. Several distributed deep learning approaches have been proposed to address issues such as model parallelism [8], data parallelism [8, 9], and the role of communication and computation [10, 11].
We focus on the constrained communication topology setting where the data is distributed (so that each agent has its own estimate of the deep model) and where information exchange among the learning agents are constrained along the edges of a given communication graph [9, 12]. In this context, two key aspects arise: consensus and optimality. We refer the reader to Figure 1 for an illustration involving 3 agents. With sufficient information exchange, the learned model parameters corresponding to each agent, could converge to , in which case they achieve consensus but not optimality (here, is the optimal model estimate if all the data were centralized). On the other hand, if no communication happens, the agents may approach their individual model estimates () while being far from consensus. The question is whether this tradeoff between consensus and optimality can be balanced so that all agents collectively agree upon a model estimate close to .
Our contributions: In this paper, we propose, analyze, and empirically evaluate two new algorithmic frameworks for distributed deep learning that enable us to explore fundamental tradeoffs between consensus and optimality. The first approach is called incremental consensusbased distributed SGD (iCDSGD), which is a stochastic extension of the descentstyle algorithm proposed in [13]. This involves running multiple consensus steps where each agent exchanges information with its neighbors within each SGD iteration. The second approach is called generalized consensusbased distributed SGD (gCDSGD), based on the concept of generalized gossip [14]. This involves a tuning parameter that explicitly controls the tradeoff between consensus and optimality. Specifically, we:

(Algorithmic) propose the iCDSGD and gCDSGD algorithms (along with their momentum variants).

(Theoretical) prove the convergence of gCDSGD (Theorems 1 & 3) and iCDSGD (Theorems 2 & 4) for strongly convex and nonconvex objective functions;

(Theoretical) prove the convergence of the momentum variants of gCDSGD (Theorem 5) and iCDSGD (Theorem 6) for strongly convex objective functions;

(Practical) empirically demonstrate that iCDMSGD (the momentum variant of iCDSGD) can achieve similar (global) accuracy as the stateoftheart with lower fluctuation across epochs as well as better consensus;

(Practical) empirically demonstrate that gCDMSGD (the momentum variant of gCDSGD) can achieve similar (global) accuracy as the stateoftheart with lower fluctuation, smaller generalization error and better consensus.
We use both balanced and unbalanced data sets (i.e., equal or unequal distributions of training samples among the agents) for the numerical experiments with benchmark deep learning data sets. Please see Table 1 for a detailed comparison with existing algorithms.
Method  Con.Bou.  Opt.Bou.  Con.Rate  Mom.Ana.  C.C.T.  Sto.  
FedAvg [15]  Nonconvex  N/A  N/A  N/A  No  No  Yes 
[13]  Strcon  No  Yes  No  
MSDA [16]  Strcon  N/A  N/A  Yes  Yes  No  
CDSGD [9]  Strcon  No  Yes  Yes  
Nonconvex  N/A  
AccDNGDSC [17]  Strcon  N/A  Yes  Yes  No  
iCDSGD [This paper]  Strcon  Yes  Yes  Yes  
Nonconvex  N/A  
gCDSGD [This paper]  Strcon  Yes  Yes  Yes  
Nonconvex  N/A 

Con.Bou.: consensus bound. Opt.Bou.: optimality bound. Con.Rate: convergence rate. Strcon: strongly convex. Mom.Ana.: momentum analysis. : step size. : the second largest eigenvalue of a stochastic matrix. : positive constant. : a positive constant. : a positive constant, and it signifies the representative meaning. They are not exactly the same in different methods. Sto.: stochastic. C.C.T.: constrained communication topology. : condition numbers. : strong convexity constant. : smoothness constants.
Related work: A large literature has emerged that studies distributed deep learning in both centralized and decentralized settings [8, 15, 18, 19, 3, 20, 21, 22], and due to space limitations we only summarize the most recent work. [23] propose a gradient sparsification approach for communicationefficient distributed learning, while [24] propose the concept of ternary gradients to reduce communication costs. [16] propose a multistep dual accelerated method using a gossip protocol to provide an optimal decentralized optimization algorithm for smooth and strongly convex loss functions. Decentralized parallel stochastic gradient descent [12] has also been proposed.
Perhaps most closely related to this paper is the work of [13], who present a distributed optimization method (called ) to enable consensus when the cost of communication is cheap. However, the authors only considered convex optimization problems, and only study deterministic gradient updates. Also, [17] propose a class of (deterministic) accelerated distributed Nesterov gradient descent methods to achieve linear convergence rate, for the special case of strongly convex objective functions. In [25], both deterministic and stochastic distributed were discussed while the algorithm had no acceleration techniques. To our knowledge, none of these previous works have explicitly studied the tradeoff between consensus and optimality.
Outline: Section 2 presents the problem and several mathematical preliminaries. In Section 3, we present our two algorithmic frameworks, along with their analysis in Section 4. For validating the proposed schemes, several experimental results based on benchmark data sets are presented in Section 5. Concluding remarks are in Section 6.
2 Problem Formulation
We consider the standard unconstrained empirical risk minimization (ERM) problem typically used in machine learning problems (such as deep learning):
(1) 
where denotes the parameter vector of interest, denotes a given loss function, and is the function value corresponding to a data point . Our focus is to investigate the case where the ERM problem is solved collaboratively among a number of computational agents. In this paper, we are interested in problems where the agents exhibit data parallelism, i.e., they only have access to their own respective training datasets. However, we assume that the agents can communicate over a static undirected graph , where is a vertex set (with nodes corresponding to agents) and is an edge set. Throughout this paper we assume that the graph is connected.
Let denote the subset of the training data (comprising samples) corresponding to the agent such that , where is the total number of agents. With this formulation, and since , we have the following (constrained) reformulation of (1):
(2) 
Equivalently, the concatenated form of the above equation is as follows:
(3) 
where , is the agent interaction matrix with its entries indicating the link between agents and , is the identity matrix of dimension , and represents the Kronecker product.
We now introduce several key definitions and assumptions that characterize the above problem.
Definition 1.
A function is said to be strongly convex, if for all , we have ; it is said to be smooth if we have ; it is said to be Lipschitz continuous if we have . Here, represents the Euclidean norm.
Definition 2.
A function is said to be coercive if it satisfies:
Assumption 1.
The objective functions are assumed to satisfy the following conditions: a) each is smooth; b) each is proper (not everywhere infinite) and coercive; c) each is Lipschitz continuous.
Assumption 2.
The interaction matrix is normalized to be doubly stochastic; the second largest eigenvalue of is strictly less than 1, i.e., , where is the second largest eigenvalue of . If , then .
For convenience, we use to represent and similar for , which signifies the largest eigenvalue of . An immediate consequence from Assumption 1 (c) is that is Lipschitz continuous, where .
We will solve (2) in a distributed and stochastic manner. For solving stochastic optimization problems, variants of the wellknown stochastic gradient descent (SGD) have been commonly employed. For the formulation in (2), the stateoftheart algorithm is a method called consensus distributed SGD, or CDSGD, recently proposed in [9]. This method estimates according to the update equation:
(4) 
where indicates the neighborhood of agent , is the step size, is the (stochastic) gradient of at , implemented by drawing a minibatch of sampled data points. More precisely, where is the size of the minibatch selected uniformly at random from the data subset available to Agent .
3 Proposed Algorithms
Stateoftheart algorithms such as CDSGD alternate between the gradient update and consensus steps. We propose two natural extensions where one can control the emphasis on consensus relative to the gradient update and hence, leads to interesting tradeoffs between consensus and optimality.
3.1 Increasing consensus
Observe that the concatenated form of the CDSGD updates, (4), can be expressed as
If we perform consensus steps interlaced with each gradient update, we can obtain the following concatenated form of the iterations of the parameter estimates:
(5) 
where, We call this variant incremental consensusbased distributed SGD (iCDSGD) which is detailed in Algorithm 1. Note, in a distributed setting, the this algorithm incurs an additional factor in communication complexity.
A different and more direct approach to control the tradeoff between consensus and gradient would be as follows:
(6) 
where, is a userdefined parameter. We call this algorithm generalized consensusbased distributed SGD (gCDSGD), and the full procedure is detailed in Algorithm 2.
1: Initialization: , , , , ,
2: Distribute the training data set to agents
3: for each agent do
4: Randomly shuffle each data subset
5: for do
6:
7: for do
8:
9: end for
10: for do
11: while do
12: // Incremental Consensus\hfill
13:
14: end while
15: end for
16:
17:
18: end for
19: end for
1: Initialization: , , , ,
2: Distribute the training data set to agents
3: for each agent do
4: Randomly shuffle each data subset
5: for do
6:
7: // Generalized Consensus\hfill
8: end for
9: end for
By examining Eq. 6, we observe that when approaches 0, the update law boils down to a only consensus protocol, and that when approaches 1, the method reduces to standard stochastic gradient descent (for individual agents).
Next, we introduce the Nesterov momentum variants of our aforementioned algorithms. The momentum term is typically used for speeding up the convergence rate with high momentum constant close to 1 [26]. For the purpose of reference and convenience, we embed the momentum variants of iCDSGD and gCDSGD within the Algorithms 1 and 2. More details can be found in Algorithms 3 and 4 in the Supplementary Section A.1.
3.2 Tools for convergence analysis
We now analyze the convergence of the iterates generated by our algorithms. Specifically, we identify an appropriate Lyapunov function (that is bounded from below) for each algorithm that decreases with each iteration, thereby establishing convergence. In our analysis, we use the concatenated (Kronecker) form of the updates. For simplicity, let .
We begin the analysis for gCDSGD by constructing a Lyapunov function that combines the true objective function with a regularization term involving a quadratic form of consensus as follows:
(7) 
It is easy to show that is smooth, and that is smooth with
Likewise, it is easy to show that is strongly convex; therefore is strongly convex with
We also assume that there exists a lower bound for the function value sequence . When the objective functions are strongly convex, we have , where is the optimizer.
Due to Assumptions 1 and 2, it is straightforward to obtain an equivalence between the gradient of Eq. 7 and the update law of gCDSGD. Rewriting (6), we get:
(8) 
Therefore, we obtain:
(9)  
The last term in (9) is precisely the gradient of . In the stochastic setting, can be approximated by sampling one data point (or a minibatch of data points) and the stochastic Lyapunov gradient is denoted by .
Similarly, the update laws for our proposed Nesterov momentum variants can be compactly analyzed using the above Lyapunov function. First, we rewrite the updates for gCDMSGD as follows:
(10a)  
(10b) 
With a few algebraic manipulations, we get:
(11)  
The above derivation simplifies the Nesterov momentumbased updates into a regular form which is more convenient for convergence analysis. For clarity, we separate this into two subequations. Let . Thus, the updates for gCDMSGD can be expressed as
(12a)  
(12b) 
Please find the similar transformation for iCDMSGD in Supplementary Section A.1.
For analysis, we require a bound on the variance of the stochastic Lyapunov gradient such that the variance of the gradient noise^{1}^{1}1As our proposed algorithm is a distributed variant of SGD, the noise in the performance is caused by the random sampling [27]. can be bounded from above. The variance of is defined as:
The following assumption is standard in SGD convergence analysis, and is based on [28].
Assumption 3.
a) There exist scalars such that and for all ; b) There exist scalars and such that for all .
Remark 1.
While Assumption 3(a) guarantees the sufficient descent of in the direction of , Assumption 3(b) states that variance of is bounded above by the second moment of . The constant can be regarded to represent the second moment of noise involving in the gradient . Therefore, the second moment of can be bounded above as , where .
For convergence analysis, we assume:
Assumption 4.
There exists a constant such that .
As the Lyapunov function is a composite function with the true cost function which is Lipschitz continuous and the regularization term associated with consensus, it can be immediately obtained that is bounded above by some positive constant.
Before turning to our main results, we present two auxiliary technical lemmas.
Lemma 1.
Lemma 2.
4 Analysis and Main Results
This section presents the main results by analyzing the convergence properties of the proposed algorithms. Our main results are grouped as follows: (i) we provide rigorous convergence analysis for gCDSGD and iCDSGD for both strongly convex and nonconvex objective functions. (ii) we analyze their momentum variants only for strongly convex objective functions. It is noted that all proofs are provided in the Supplementary Section A.1.
4.1 Convergence Analysis for iCDSGD and gCDSGD
Our analysis will consist of two components: establishing an upper bound on how far away the estimates of the individual agents are with respect to their empirical mean (which we call the consensus bound), and establishing an upper bound on how far away the overall procedure is with respect to the optimum (which we call the optimality bound).
First, we obtain consensus bounds for the gCDSGD and iCDSGD as follows.
Proposition 1.
Proposition 2.
We provide a discussion on comparing the consensus bounds in the Supplementary Section A.2. Next, we obtain optimality bounds for gCDSGD and iCDSGD.
Theorem 1.
Theorem 2.
Although we show the convergence for strongly convex objectives, we note that objective functions are higly nonconvex for most deep learning applications. While convergence to a global minimum in such cases is extremely difficult to establish, we prove that gCDSGD and iCDSGD still exhibits weaker (but meaningful) notions of convergence.
Theorem 3.
Theorem 4.
Remark 2.
Let us discuss the rates of convergence suggested by Theorems 1 and 3. We observe that when the objective function is strongly convex, the function value sequence can linearly converge to within a fixed radius of convergence, which can be calculated as follows:
When the objective function is nonconvex, we cannot claim linear convergence. However, Theorem 3 asserts that the average of the second moment of the Lyapunov gradient is bounded from above. Recall that the parameter bounds the variance of the “noise” due to the stochasticity of the gradient, and if , Theorem 3 implies that asymptotically converges to a firstorder stationary point.
Remark 3.
For gCDSGD, let us investigate the corner cases where or . For the strongly convex case, when , we have , where is the condition number. This suggests that if consensus is not a concern, then each iterate converges to its own respective , as depicted in Fig. 1. On the other hand, when , the upper bound converges to . In such a scenario, each agent sufficiently communicates its own information with other agents to arrive at an agreement. In this case, the upper bound depends on the topology of the communication network. If , this results in:
For the nonconvex case, when , the upper bound suggested by Theorem 3 is , while leads to , which is roughly if .
We also compare iCDSGD and CDSGD with gCDSGD in terms of the optimality upper bounds to arrive at a suitable lower bound for . However, due to the space limit, the analysis is presented in the Supplementary Section A.2.
4.2 Convergence Analysis for momentum variants
We next provide a convergence analysis for the gCDMSGD algorithm, summarized in the update laws given in Eq. 12. A similar analysis can be applied to iCDMSGD. Before stating the main result, we define the sequence as:
(22)  
where represents the average of the objective function values of a minibatch. We define as follows
We now state our main result, which characterizes the performance of gCDMSGD. To our knowledge, this is the first theoretical result for momentumbased versions of consensusdistributed SGD.
Theorem 5.
Theorem 6.
Note, although the theorem statements look the same for gCDMSGD and iCDMSGD, the constants are significantly different from each other. Theorem 5 suggests that with a sufficiently small step size, using Nesterov acceleration results in a linear convergence (with parameter ) up to a neighbourhood of of radius . When , the first term on the right hand side vanishes and substituting into , we have
which implies that the upper bound is related to the spectral gap of the network; hence, a similar conclusion Theorem 1 can be deduced. When , the upper bound becomes . However, leads to . These two scenarios demonstrates that the ‘‘gradient noise’’ cased by the stochastic sampling negatively affects the convergence. One can use to tradeoff the consensus and updates.
Next, we discuss the upper bounds obtained when for gCDSGD and gCDMSGD. (1) : When is sufficiently small and , it can be observed that the optimality bound for the Nesterov momentum variant is smaller than that for gCDSGD as ; (2) : When and are carefully selected such that , we have when . Therefore, introducing the momentum can speed up the convergence rate with appropriately chosen hyperparameters.
5 Experimental Results
We validate our algorithms via several experimental results using the CIFAR10 image recognition dataset (with standard training and testing sets). The model adopted for the experiments is a deep convolutional neural network (CNN) (with ReLU activations) which includes 2 convolutional layers with 32 filters each followed by a max pooling layer, then 2 more convolutional layers with 64 filters each followed by another max pooling layer, and a dense layer with 512 units. The minibatch size is set to 512, and step size is set to 0.01 in all experiments. All experiments were performed using Keras with TensorFlow [29, 30]. We use a sparse network topology with 5 agents. We use both balanced and unbalanced data sets for our experiments. In the balanced case, agents have an equal share of the entire training set. However, in the unbalanced case, agents have (randomly selected) unequal parts of the training set while making sure that each agent has at least half of the equal share amount of examples. We summarize our key experimental results in this section, with more details and results provided in the Supplementary Section A.4.
Performance of algorithms. In Figure 1(a), we compare the performance of the momentum variants of our proposed algorithms, iCDMSGD and gCDMSGD (with ) with stateofthe art techniques such as CDMSGD and Federated Averaging using an unbalanced data set. All algorithms were run for 3000 epochs. Observing the average accuracy over all the agents for both training and test data, we note that iCDMSGD can converge as fast as CDMSGD with lesser fluctuation in the performance across epochs. While being slower in convergence, gCDMSGD acheves similar performance (with test data) with less fluctuation as well as smaller generalization gap (i.e., difference between training and testing accuracy). All algorithms significantly outperform Federated Averaging in terms of average accuracy. We also vary the tuning parameter for gCDMSGD to show (in Figure 1(b)) that it is able to achieve similar (or better) convergence rate as CDMSGD using higher values with some sacrifice in terms of the generalization gap.
Degree of Consensus. One of the main contribution of our paper is to show that one can control the degree of consensus while maintaining average accuracy in distributed deep learning. We demonstrate this by observing the accuracy difference between the best and the worst performing agents (identified by computing the mean accuracy for the last 100 epochs). As shown in Figure 1(c), the degree of consensus is similar for all three algorithms for balanced data set, with iCDMSGD performing slightly better than the rest. However, for an unbalanced set, both iCDMSGD and gCDMSGD perform significantly better compared to CDMSGD. Note, the degree of consensus can be further improved for gCDMSGD using lower values of as shown in Figure 1(d). However, the convergence becomes relatively slower as shown in Figure 1(b). We do not compare these results with the Federated Averaging algorithm as it performs a brute force consensus at every epoch using centralized parameter server. We also do not vary as the doubly stochastic agent interaction matrix for the small agent population becomes stationary very quickly with a very small value of . However, this will be explored in our future work with significantly bigger networks.
6 Conclusion and Future Work
For investigating the tradeoff between consensus and optimality in distributed deep learning with constrained communication topology, this paper presents two new algorithms, called iCDSGD and gCDSGD and their momentum variants. We show the convergence properties for the proposed algorithms and the relationships between the hyperparameters and the consensus & optimality bounds. Theoretical and experimental comparison with the stateoftheart algorithm called CDSGD, shows that iCDSGD, and gCDSGD can improve the degree of consensus among the agents while maintaining the average accuracy especially when there is data imbalance among the agents. Future research directions include learning with nonuniform data distributions among agents and timevarying networks.
References
 [1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436444, 2015.
 [2] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693701, 2011.
 [3] Peter H Jin, Qiaochu Yuan, Forrest Iandola, and Kurt Keutzer. How to scale distributed deep learning? arXiv preprint arXiv:1611.04581, 2016.
 [4] Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learning for detecting robotic grasps. The International Journal of Robotics Research, 34(45):705724, 2015.
 [5] Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu Palaniswami. Internet of things (iot): A vision, architectural elements, and future directions. Future generation computer systems, 29(7):16451660, 2013.
 [6] Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, and Fahim Kawsar. An early resource characterization of deep learning on wearables, smartphones and internetofthings devices. In Proceedings of the 2015 International Workshop on Internet of Things towards Applications, pages 712. ACM, 2015.
 [7] Nicholas D Lane and Petko Georgiev. Can deep learning revolutionize mobile sensing? In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications, pages 117122. ACM, 2015.
 [8] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 12231231, 2012.
 [9] Zhanhong Jiang, Aditya Balu, Chinmay Hegde, and Soumik Sarkar. Collaborative deep learning in fixed topology networks. Neural Information Processing Systems (NIPS), 2017.
 [10] Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 1927, 2014.
 [11] Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, and Pradeep Dubey. Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint arXiv:1602.06709, 2016.
 [12] Xiangru Lian, Ce Zhang, Huan Zhang, ChoJui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 53365346, 2017.
 [13] Albert S Berahas, Raghu Bollapragada, Nitish Shirish Keskar, and Ermin Wei. Balancing communication and computation in distributed optimization. arXiv preprint arXiv:1709.02999, 2017.
 [14] Zhanhong Jiang, Kushal Mukherjee, and Soumik Sarkar. Generalised gossipbased subgradient method for distributed optimisation. International Journal of Control, pages 117, 2017.
 [15] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communicationefficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
 [16] Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In International Conference on Machine Learning, pages 30273036, 2017.
 [17] Guannan Qu and Na Li. Accelerated distributed nesterov gradient descent. arXiv preprint arXiv:1705.07176, 2017.
 [18] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pages 685693, 2015.
 [19] Michael Blot, David Picard, Matthieu Cord, and Nicolas Thome. Gossip training for deep learning. arXiv preprint arXiv:1611.09726, 2016.
 [20] Zheng Xu, Gavin Taylor, Hao Li, Mario Figueiredo, Xiaoming Yuan, and Tom Goldstein. Adaptive consensus admm for distributed optimization. In International Conference on Machine Learning, pages 38413850, 2017.
 [21] Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, ZhiMing Ma, and TieYan Liu. Asynchronous stochastic gradient descent with delay compensation for distributed deep learning. In International Conference on Machine Learning, pages 41204129, 2017.
 [22] Wenpeng Zhang, Peilin Zhao, Wenwu Zhu, Steven CH Hoi, and Tong Zhang. Projectionfree distributed online learning in networks. In International Conference on Machine Learning, pages 40544062, 2017.
 [23] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communicationefficient distributed optimization. arXiv preprint arXiv:1710.09854, 2017.
 [24] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pages 15081518, 2017.
 [25] Konstantinos I Tsianos and Michael G Rabbat. Distributed strongly convex optimization. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pages 593600. IEEE, 2012.
 [26] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 11391147, 2013.
 [27] Shuang Song, Kamalika Chaudhuri, and Anand Sarwate. Learning from data with heterogeneous noise using sgd. pages 894902, 2015.
 [28] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for largescale machine learning. arXiv preprint arXiv:1606.04838, 2016.
 [29] François Chollet et al. Keras, 2015.
 [30] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In OSDI, volume 16, pages 265283, 2016.
 [31] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
 [32] Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pages 15741582, 2014.
Appendix A Supplementary Materials
a.1 Omitted algorithms
Update rules for the momentum variant, iCDMSGD. The compact form of iCDMSGD is expressed as follows:
(25a)  
(25b) 
Rewriting the above equations yields:
(26)  
Letting , we have
(27a)  
(27b) 
a.1.1 Proofs of main lemmas
We repeat the statements of all lemmas and theorems for completeness.
Lemma 1: Let Assumptions 1 and 2 hold. The iterates of gCDSGD (Algorithm 2) satisfy the following :
(28)  
Proof.
By Assumption 1, the iterates generated by gCDSGD satisfy:
(29)  
Taking expectations on both sides, we can obtain
(30)  
While is deterministic, can be considered to be stochastic due to the random sampling aspect. Therefore, we have
(31)  
which completes the proof. ∎