Fast and Differentially Private Algorithms for
Decentralized Collaborative Machine Learning
Abstract
Consider a set of agents in a peertopeer communication network, where each agent has a personal dataset and a personal learning objective. The main question addressed in this paper is: how can agents collaborate to improve upon their locally learned model without leaking sensitive information about their data? Our first contribution is to reformulate this problem so that it can be solved by a block coordinate descent algorithm. We obtain an efficient and fully decentralized protocol working in an asynchronous fashion. Our second contribution is to make our algorithm differentially private to protect against the disclosure of any information about personal datasets. We prove convergence rates and exhibit the tradeoff between utility and privacy. Our experiments show that our approach dramatically outperforms previous work in the nonprivate case, and that under privacy constraints we significantly improve over purely local models.
1 Introduction
Connected personal devices are now widespread: they can collect and process increasingly large and sensitive user data. For instance, a smartphone is able to log webpages that its owner visited but also the physical locations that he/she went to, how much he/she walks in a day, etc; smart home devices can record voice commands, room temperature, energy consumption, and so on. While this information can be leveraged through machine learning to provide useful personalized services to the user, it also raises serious privacy concerns. Indeed, a common practice is to centralize data from all users on an external server for batch processing, sometimes without explicit consent from users and with little oversight. On the other hand, if the data is considered too sensitive to be shared (due to legislation or because the user opts out), then one has to learn on each device separately without taking advantage of the multiplicity of data sources (e.g., information from similar users). This approach respects privacy but leads to poor accuracy, in particular for new or moderately active users who have not yet collected much data.
Ideally, users (agents) should be able to collaborate to learn more accurate models while ensuring that their data stay on their local device and that the algorithm does not leak sensitive information to others. Specifically, we are interested in the fully decentralized setting: agents operate asynchronously and communicate over a network in a peertopeer fashion, without any central entity to aggregate results or even to coordinate the protocol. Such a decentralized architecture can scale to large sets of users, and intrinsically provides additional security guarantees as it is more difficult for a malicious party to systematically collect all the information transmitted over the network. Decentralized collaborative learning has been recently considered in [24], but this work did not consider any privacy constraints. In fact, while there has been a large body of work on privacypreserving machine learning from centralized data, notably based on differential privacy (see [9, 4, 2] and references therein), the case where sensitive datasets are distributed across multiple data owners has been much less studied, let alone the fully decentralized setting. Existing approaches for the distributed case [19, 12, 23, 20, 22] require a central (sometimes trusted) server, assume the local data distribution is the same for all users and/or are designed to learn a single global model rather than a personal model for each user.
In this paper, we ask a challenging question: given decentralization and privacy constraints, can agents improve upon their purely local models through collaboration? Our contributions towards a positive answer to this question are threefold. First, we propose a decentralized and asynchronous block coordinate descent algorithm for collaborative learning. Taking advantage of the structure of the problem, this algorithm accommodates general loss functions, with simple updates and provable convergence rates. Second, we design a differentiallyprivate scheme based on randomly perturbing each update of our algorithm. This scheme guarantees that the messages sent by the users over the network during the execution of the algorithm do not reveal significant information about any data point of any local dataset. We formally analyze the utility loss due to privacy, with interesting implications on the optimal way to scale the noise across iterations. Third, we conduct experiments to validate our approach. The empirical results show that the tradeoff between utility and privacy is in line with our theoretical findings, and that under strong privacy constraints we can still outperform the purely local models in terms of accuracy.
The rest of the paper is organized as follows. Section 2 introduces the problem setting, the notion of differential privacy, and discusses relevant work. In Section 3, we present our decentralized block coordinate descent algorithm and its convergence guarantees. Section 4 introduces a differentially private version of our algorithm and studies the utility loss. Finally, Section 5 is dedicated to numerical experiments. Detailed proofs can be found in the supplementary material.
2 Preliminaries and Background
We start by describing the decentralized collaborative learning framework that we consider in this paper, and briefly present existing results. We then review the notion of differential privacy and go over some relevant work in this area.
2.1 Decentralized Collaborative Learning of Personal Models
Problem setting. Consider a set of agents. Each agent has a local data distribution over the space and has access to a set of training examples drawn i.i.d. from . The goal of agent is to learn a model with small expected loss , where the loss function is convex in and measures the performance of on data point . In a setting where agent is not aware of the existence of other users and acts on its own, the standard approach is to learn a model by minimizing its (potentially regularized) empirical loss:
(1) 
In this paper, agents do not learn in isolation but rather participate in a decentralized peertopeer network over which they can exchange information. Such collaboration gives them the opportunity to learn a better model than (1), for instance by allowing some agents to compensate for their lack of data. Formally, let be a weighted connected graph over the set of agents where is the set of edges and is a nonnegative weight matrix. gives the weight of edge with the convention that if or . Following previous work (see e.g., [11, 24]), we assume that the edge weights reflect a notion of “task relatedness”: the weight between agents and tends to be large if the models minimizing their respective expected loss are similar. In order to scale to large networks, our goal is to design fully decentralized algorithms: each agent only communicates with its neighborhood without global knowledge of the network, and can proceed without synchronizing with other agents across the network. Overall, the problem can thus be seen as a multitask learning problem over a large number of tasks (agents) with imbalanced training sets, which must be solved in a fully decentralized way.
Relevant work. Most of the work in decentralized learning and optimization has focused on the distributed consensus problem, where the goal is to find a single global model which minimizes the sum of the local loss functions (see e.g., [18, 21, 6, 25, 5]. For decentralized learning of personal models, [24] considered a general objective function which trades off between models with small empirical local loss and models that are smooth within neighborhoods (see Eq. 3 in Section 3). At the cost of introducing many auxiliary variables, they are able to cast the objective as a partial consensus problem over the network which can be solved using a decentralized gossip ADMM algorithm [26]. It involves minimizing a perturbed version of the local loss of two neighboring agents at each iteration and has an convergence rate. Privacy constraints were not considered in this work.
2.2 Differential Privacy
Differential Privacy (DP) [7] has emerged as a powerful measure of how much information about any individual entry of a dataset is contained in the output of an algorithm. Formally, let be a randomized mechanism taking a dataset as input, and let . We say that is differentially private if for all datasets differing in a single data point and for all sets of possible outputs , we have:
(2) 
where the probability is over the randomness of the mechanism. At a high level, one can see (2) as ensuring that does not leak much information about any individual data point contained in . DP has many attractive properties: in particular it provides strong robustness against background knowledge attacks and does not rely on computational assumptions. The composition of several DP mechanisms remains DP, albeit a graceful degradation in the parameters (see [10, 14] for strong composition results). We refer to [9] for more details on DP.
Relevant work. DP has been mostly considered in the context where a “trusted curator” has access to all data. Existing DP schemes for machine learning in this setting typically rely on the addition of appropriately scaled noise to the learned model (output perturbation) or to the objective function itself (objective perturbation), see for instance [4]. The private multiparty setting, in which sensitive datasets are distributed across multiple data owners, is known to be harder [16] and has been less studied in spite of its relevance for many applications. Local DP [6, 15], consisting in locally perturbing the data points themselves before releasing them, often achieves poor accuracy (especially when local datasets are small). An alternative strategy is to rely on DP aggregation of models locally trained by each party [19, 12]. DP schemes for (stochastic) gradient descent in the distributed setting have also been proposed, based on perturbing the gradients, the iterates and/or the objective [23, 20, 13, 22]. Apart from local DP, the above methods do not apply to our setting for a combination of reasons. In particular, the local data distribution is different for each party and we learn a personal model for each agent instead of a single global model. Last but not least, we seek an asynchronous and fully decentralized algorithm without any master node to perform aggregation or coordinate the protocol. We are not aware of any previous DP machine learning schemes designed for this setting.
3 Decentralized Collaborative Learning with Block Coordinate Descent
We start by introducing some convenient notations. For any and , we will denote by its th block of size . We also define the matrices , , such that . We thus have for any .
3.1 Objective Function
Our goal in collaborative learning is to jointly learn the models of the agents by leveraging both their local datasets and the similarity information embedded in the network graph. We rely on the principle of graph regularization used in [11, 24] to favor models that vary smoothly on the graph. Specifically, representing the set of all models as a stacked vector , the objective function we wish to minimize is defined as follows:
(3) 
where is a tradeoff parameter, is a normalization factor and is the “confidence” of agent .^{1}^{1}1In practice we will set (plus some small constant when ). Minimizing (3) implements a tradeoff between having similar models for strongly connected agents and models that are accurate on their respective local datasets (the higher the confidence of an agent, the more importance given to the latter part). This allows agents to leverage relevant information from their neighbors — it is particularly salient for agents with less data which can gain useful knowledge from betterendowed neighbors without “polluting” them with their own inaccurate model.
We now discuss a few assumptions and properties of . We assume that for any , the local loss function of agent is convex in its first argument with Lipschitz continuous gradient. This implies that is convex in .^{2}^{2}2This follows from the fact that the first term in (3) is a Laplacian quadratic form, hence convex in . If we further assume that each local loss is strongly convex in its first argument with (this is the case for instance when the local loss is L2regularized), then is strongly convex with . In other words, for any we have The partial derivative of corresponding to the variables in is given by
(4) 
For , the th block Lipschitz constant of satisfies for any and . It is easy to see that .
3.2 Proposed Algorithm
Our goal is to minimize the objective function (3) in a fully decentralized manner. Specifically, we operate in the asynchronous execution model [3, 1, 17]: each agent has a local clock ticking at the times of a rate 1 Poisson process, and wakes up when it ticks without waiting for other agents. As local clocks are i.i.d., we can equivalently consider a single clock which ticks when one of the local clocks ticks. This provides a more convenient way to state and analyze the algorithms in terms of a global clock counter . For communication, we rely on a broadcastbased model [1, 17] where agents communicate by sending messages to all their neighbors at once (without expecting a reply). This is in contrast to gossipbased algorithms which rely on bidirectional communication between pairs of agents. The broadcastbased model is very appealing in wireless distributed systems, since sending a message to all neighbors has the same cost as sending to a single neighbor.
Given the above constraints, we propose a decentralized coordinate descent algorithm to minimize (3). Suppose that at time step , agent wakes up. Two consecutive actions are performed by :

Update step: agent updates its local model based on the most recent information received from its neighbors :
(5) where .

Broadcast step: agent sends its updated model to its neighborhood .
All other variables in the network remain unchanged at that iteration.
The update step (3.2) consists in a block coordinate descent update with respect to and only requires agent to know the models previously broadcast by its neighbors . Furthermore, the agent does not need to know the global iteration counter , hence no global clock is needed. The algorithm is thus fully decentralized and asynchronous. Interestingly, notice that this block coordinate descent update is adaptive to the confidence level of each agent in two respects: (i) globally, the more confidence, the more importance given to the gradient of the local loss compared to the neighbors’ models, and (ii) locally, when is close to a minimizer of the local loss (which is the case for instance if we initialize to such a minimizer), agents with low confidence will trust their neighbors’ models more aggressively than agents with high confidence (which will make more conservative updates).^{3}^{3}3This second property is in contrast to a (centralized) gradient descent approach which would use the same constant, more conservative step size (equal to the standard Lipschitz constant of ) for all agents. This is in line with the intuition that agents with low confidence should diverge more quickly from their local minimizer than those with high confidence.
Under our assumption that the local clocks of the agents are i.i.d., the above algorithm can be seen as a randomized block coordinate descent algorithm [27]. It enjoys a fast linear convergence rate when is strongly convex, as shown in the following result.
Proposition 1 (Convergence rate).
For any , let be the sequence of iterates generated by the proposed algorithm running for iterations from an initial point . Let . When is strongly convex, we have:
Proof.
Remark 1.
For general convex , an rate can be obtained, see [27] for details.
The above result shows that each iteration shrinks the suboptimality gap by a constant factor. While this factor degrades linearly with the number of agents , this is compensated by the fact that the number of iterations done in parallel also scales roughly linearly with (because agents wake up asynchronously). We thus expect the algorithm to scale gracefully with the size of the network if the number of updates per agent remains constant. The value is the ratio between the lower and upper bound on the curvature of . Focusing on the relative differences between agents and assuming constant ’s and ’s, it indicates that the algorithm converges faster when the degreeweighted confidence of agents is approximately the same. On the other hand, two types of agents can represent a bottleneck for the convergence rate: (i) a highconfidence and highdegree agent (the overall progress is then very dependent on the updates of that agent), and (ii) a lowconfidence agent which is also poorly connected (and thus converges slowly).
Remark 2 (Comparison to existing ADMM algorithm).
Our algorithm has several advantages over the decentralized ADMM introduced by [24]. It is much simpler (no auxiliary variable needed), achieves linear convergence rate for strongly convex functions, and each iteration is computationally cheaper (only one local gradient step instead of full minimization). We will show in Section 5 that our algorithm indeed performs much better in practice.
4 Differentially Private Collaborative Learning
As described above, the algorithm introduced in the previous section has many interesting properties. However, it is not differentiallyprivate: while there is no direct exchange of data between agents, the sequence of iterates broadcast by an agent may reveal information about its private dataset through the gradient of the local loss. In this section, we start by defining the privacy model of interest and then introduce an appropriate scheme to make our algorithm private. We study its utility loss and the tradeoff between utility and privacy.
4.1 Privacy Model
At a high level, our goal is to prevent eavesdropping attacks. We assume the existence of an adversary who observes all the information sent over the network during the execution of the algorithm, but cannot access the agents’ internal memory. We want to ensure that such an adversary cannot learn much information about any individual data point of any agent’s dataset. This is a very strong notion of privacy: each agent does not trust any other agent or any thirdparty to process its data, hence the privacypreserving mechanism must be implemented at the agent level. Furthermore, note that our privacy model protects any agent against all other agents even if they collude (i.e., share the information they receive).^{4}^{4}4We assume a honestbutcurious model for the agents: they want to learn as much as possible from the information that they receive but they truthfully follow the protocol.
To formally define this privacy model, we rely on the notion of Differential Privacy introduced in Section 2.2. Following the notations of (2), each agent runs a mechanism which takes its local dataset and outputs all the information sent by over the network during the execution of the algorithm (i.e., the sequence of iterates broadcast by the agent). Our goal is to make DP for all agents simultaneously. Note that learning purely local models (1) is a perfectly private baseline according to the above definition as agents do not exchange any information. Below, we present a way to collaboratively learn better models while ensuring privacy.
4.2 PrivacyPreserving Scheme
The privacypreserving version of our algorithm consists in replacing the update step in (3.2) by the following one (assuming that at time agent wakes up):
(6) 
where is a noise vector drawn from a Laplace distribution with finite scale .^{5}^{5}5When , we use the convention w.p. . The difference with the nonprivate update is that agent adds appropriately scaled Laplace noise to the gradient of its local loss . It then sends the resulting noisy iterate , instead of , to its neighbors.
Assume that update (6) is run times by agent within the total iterations across the network. Let be the set of iterations at which agent woke up and consider the mechanism . The following theorem gives the scale of the additional noise at each iteration, , to provide desired overall differential privacy guarantees.
Theorem 1 (Differential privacy of ).
Let and assume that where is Lipschitz with respect to the norm for all . For any , let for some . For any and initial point independent of , the mechanism is DP with
Remark 3.
We can obtain a similar result if we assume Lipschitzness of w.r.t. norm (instead of ) and use Gaussian noise (instead of Laplace). Details are in the supplementary material.
Theorem 1 shows that is DP for . One can also achieve a better scaling for at the cost of setting (see [14] for a discussion of the tradeoffs in the composition of DP mechanisms). Further note that the noise scale needed to guarantee DP for an agent is inversely proportional to the size of its local dataset . This is a key property for collaborative learning: agents with more local data (and hence larger confidence and more informative gradients) will add less noise and thus pass useful information to their neighbors. In contrast, agents with small datasets will add more noise but will marginally influence other agents due to their low confidence.
The next result quantifies how the added noise affects the convergence.
Theorem 2 (Utility loss).
For any , let be the sequence of iterates generated by iterations of update (6) from an initial point . For strongly convex , we have:
where .
Theorem 2 shows that the optimization error of the private algorithm after iterations decomposes into two terms. The first term is the same as in the nonprivate setting and decreases with . The second term gives an additive error due to the addition of noise, which takes the form of a weighted sum of the variance of the noise added to the iterate at each iteration (note that we indeed recover the nonprivate convergence rate of Proposition 1 when the noise scale is ). When the noise scale used by each agent is constant across iterations, i.e. for any and , this additive error is a sum of a geometric series which converges to a finite number as .
In practical scenarios, each agent has a overall privacy budget . Assume that the agents agree on a value for (e.g., using Proposition 1 to achieve the desired precision). Each agent is thus expected to wake up times, and can use Theorem 1 to appropriately distribute its privacy budget across the iterations and stop after updates. While distributing the budget equally across the iterations is a simple and practical strategy, Theorem 2 suggests that better utility can be achieved if the noise scale increases with time. Assume that agents know in advance the clock schedule for a particular run of the algorithm, i.e. agent knows the global iterations at which it will wake up. The following result then gives the noise allocation policy minimizing the utility loss.
Proposition 2.
Let and for any agent define . Assuming for as in Theorem 1, the following privacy parameters guarantee optimize the utility loss while ensuring that the budget is matched exactly:
The above noise allocation policy requires the agents to know the schedule in advance as well as the global iteration counter. This is an unrealistic assumption in the fully decentralized setting where no global clock is available. Still, Proposition 2 may be useful to design heuristic strategies that are practical, for instance, based on using the expected global time for the agent to wake up at each of its iterations. We leave this for future work.
Remark 4.
Theorem 2 implies that it is beneficial to have a good warm start point : however, must also be DP. In the supplementary material, we describe a strategy to generate such a private warm start based on the propagation of locally perturbed models throughout the network.
5 Numerical Experiments
Task description. To be able to compare our algorithm to the one in [24], we conducted experiments on the collaborative linear classification task introduced by the same authors. We briefly recall the setup. Consider a set of agents. Each of these agents has an underlying target linear separator in (unknown to the agent) whose first two entries are drawn from a centered normal distribution and the remaining entries are set to . The weight between two agents and is given by , where is the angle between the target models and (negligible weights are ignored). Each agent receives a random number of training points ( is drawn uniformly between and ), and each training point is drawn uniformly around the origin and labeled according to the target model. We then add some label noise, independently flipping each label with probability . The loss function used by all agents is the logistic loss (which is Lipschitz), and the L2 regularization parameter of an agent is set to to ensure the overall strong convexity. The hyperparameter is tuned to maximize accuracy of the nonprivate algorithm on a validation set of random problems instances. For each agent, the test accuracy of a model is estimated on a separate sample of test points.
Nonprivate setting: CD versus ADMM. We start by comparing our coordinate descent algorithm (3.2) to the ADMM algorithm proposed by [24] in the nonprivate setting. Both algorithms are fully decentralized and asynchronous, but recall that our algorithm relies on a broadcast communication model (a node sends information to all its neighbors) while the ADMM algorithm is gossipbased (a node exchanges information with a random neighbor). Which communication model is the most efficient strongly depends on the network infrastructure, but we can meaningfully compare the algorithms by tracking the objective value and the test accuracy with respect to the number of iterations and the number of dimensional vectors transmitted along the edges of the network. Both algorithms are initialized using the purely local models, i.e. for all . Figure 1 shows the results (averaged over 5 runs) for dimension : our coordinate descent algorithm significantly outperforms ADMM despite the fact that ADMM makes several local gradient steps at each iteration ( in this experiment). We believe that this is mostly due to the fact that the 4 auxiliary variables per edge needed by ADMM to encode smoothness constraints are updated only when a particular edge is activated. In contrast, our CD algorithm does not require auxiliary variables.
Private setting. We now turn to the privacypreserving setting. In this experiment, each agent has the same overall privacy budget . It splits its privacy budget equally across iterations using Theorem 1 with , and stops updating when it is done. We first illustrate empirically the tradeoffs implied by Theorem 2: namely that running more iterations per agent reduces the first term of the bound but increases the second term because more noise is added at each iteration. This behavior is easily seen in Figure 2(a), where is initialized to a constant vector. In Figure 2(b), we have initialized the algorithm with a private warm start solution with (see supplementary material). The results confirm that for a modest additional privacy budget, a good warm start point can lead to lower values of objective function with less iterations (as suggested again by Theorem 2). The gain in test accuracy here is significant.
Figure 2(c) represents the results for various dimensions between and , averaged over runs. We have used the same private warm start strategy as in Figure 2(b), and the number of iterations per node was tuned based on a validation set of random problems instances. We see that even under a small privacy budget (), the resulting models significantly outperform the purely local learned models (a perfectly private baseline). In the supplementary material, we display additional results showing that all agents (irrespective of their dataset size) get an improvement in test accuracy. This improvement is especially large for users with smaller local datasets, effectively correcting the imbalance in dataset size. We also show that perturbing the data itself (local DP [6, 15]) leads to very inaccurate models. These results demonstrate the relevance of our privacypreserving collaborative learning approach.
6 Conclusion
We introduced and analyzed an efficient algorithm for decentralized collaborative learning under privacy constraints. We believe that this problem is becoming more and more relevant as connected objects become ubiquitous. Further research is needed to address more dynamic scenarios: agents may join or leave during the execution, data may be collected online, etc.
Acknowledgments
This work was partially supported by grant ANR16CE23001601, by a grant from CPER NordPas de Calais/FEDER DATA Advanced data science and technologies 20152020 and by European ERC Grant 339539  AOC (AdversaryOriented Computing).
References
 [1] Tuncer Can Aysal, Mehmet Ercan Yildiz, Anand D. Sarwate, and Anna Scaglione. Broadcast gossip algorithms for consensus. IEEE Transactions on Signal Processing, 57(7):2748–2761, 2009.
 [2] Raef Bassily, Adam D. Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In FOCS, 2014.
 [3] Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algorithms. IEEE/ACM Transactions on Networking, 14(SI):2508–2530, 2006.
 [4] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12:1069–1109, 2011.
 [5] Igor Colin, Aurélien Bellet, Joseph Salmon, and Stéphan Clémençon. Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions. In ICML, 2016.
 [6] John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. Privacy Aware Learning. In NIPS, 2012.
 [7] Cynthia Dwork. Differential Privacy. In ICALP, volume 2, 2006.
 [8] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, 2006.
 [9] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014.
 [10] Cynthia Dwork, Guy N. Rothblum, and Salil Vadhan. Boosting and differential privacy. In FOCS, 2010.
 [11] Theodoros Evgeniou and Massimiliano Pontil. Regularized multitask learning. In KDD, 2004.
 [12] Jihun Hamm, Yingjun Cao, and Mikhail Belkin. Learning privately from multiparty data. In ICML, 2016.
 [13] Zhenqi Huang, Sayan Mitra, and Nitin Vaidya. Differentially private distributed optimization. ICDCN, 2015.
 [14] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. In ICML, 2015.
 [15] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. Extremal mechanisms for local differential privacy. Journal of Machine Learning Research, 17:1–51, 2016.
 [16] Andrew McGregor, Ilya Mironov, Toniann Pitassi, Omer Reingold, Kunal Talwar, and Salil Vadhan. The limits of twoparty differential privacy. In FOCS, 2010.
 [17] Angelia Nedic. Asynchronous broadcastbased convex optimization over a network. IEEE Transactions on Automatic Control, 56(6):1337–1351, 2011.
 [18] Angelia Nedic and Asuman E. Ozdaglar. Distributed Subgradient Methods for MultiAgent Optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
 [19] Manas A. Pathak, Shantanu Rane, and Bhiksha Raj. Multiparty differential privacy via aggregation of locally trained classifiers. In NIPS, 2010.
 [20] Arun Rajkumar and Shivani Agarwal. A differentially private stochastic gradient descent algorithm for multiparty classification. In AISTATS, 2012.
 [21] S. Sundhar Ram, Angelia Nedic, and Venugopal V. Veeravalli. Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization. Journal of Optimization Theory and Applications, 147(3):516–545, 2010.
 [22] Reza Shokri and Vitaly Shmatikov. Privacypreserving deep learning. In CCS, 2015.
 [23] Shuang Song, Kamalika Chaudhuri, and Anand D. Sarwate. Stochastic gradient descent with differentially private updates. In GlobalSIP, 2013.
 [24] Paul Vanhaesebrouck, Aurélien Bellet, and Marc Tommasi. Decentralized Collaborative Learning of Personalized Models over Networks. In AISTATS, 2017.
 [25] Ermin Wei and Asuman E. Ozdaglar. Distributed Alternating Direction Method of Multipliers. In CDC, 2012.
 [26] Ermin Wei and Asuman E. Ozdaglar. On the O(1/k) Convergence of Asynchronous Distributed Alternating Direction Method of Multipliers. In GlobalSIP, 2013.
 [27] Stephen J. Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.
Supplementary Material
Appendix A Proofs
a.1 Proof of Theorem 1
We first show that for agent and an iteration , the additional noise provides differential privacy for the published . In the following, two datasets and are called neighbors if they differ in a single data point. We denote this neighboring relation by .
We will need the following lemma.
Lemma 1.
For two neighboring datasets and of the size :
Proof.
Assume that instead of data point in , there is in . As and are neighboring datasets, the other data points in and are the same. Hence:
since the Lipschitzness of (with respect to the norm) for all implies that for any and , we have . ∎
We continue the proof by bounding the sensitivity of to find the noise scale needed to satisfy differential privacy. Using Eq. 4, Eq. 3.2 and Lemma 1, we have:
(7)  
(8)  
where (7)(8) follow from the fact that is the only quantity in the update (3.2) which depends on the local dataset of agent .
Recalling the relation between sensitivity and the scale of the addition noise in the context of differential privacy [8], we should have:
where is the scale of the noise added to . In the following we show that . To compute , we see how the noise affects . Using Eq. 6, definitions of (Update step) and (the block Lipschitz constant) we have:
So the scale of the noise added to is:
Therefore, is satisfied, hence publishing is differentially private.
We have shown that at any iteration , publishing by agent is differentially private. The mechanism published all for . Using the composition result for differential privacy established in [14], we have that the mechanism is DP with as in Theorem 1.
Theorem 1 considers the case where is Lipschitz for all with respect to the norm. We could instead assume Lipschitzness with respect to the norm, in which case the noise to add should be Gaussian instead of Laplace. The following remark computes the additional normal noise to preserve differential privacy in this setting.
Remark 5.
Let . In the case where is Lipschitz with respect to the norm for all , for any , let for some and . For the noise vector drawn from a Gaussian distribution with scale , and for any and initial point independent of , the mechanism is DP with
a.2 Proof of Theorem 2
We start by introducing a convenient lemma.
Lemma 2.
For any , and we have:
Proof.
We get this by applying Taylor’s inequality to the function
∎
Recall that the random variable represents the noise added by agent due to privacy requirements if it wakes up at iteration . To simplify notations we denote the scaled version of the noise by .
Recall that under our Poisson clock assumption, each agent is equally likely to wake up at any step . Subtracting and taking the expectation with respect to on both sides, we thus get:
(9) 
where .
For convenience, let us define where denotes the expectation with respect to all variables and . Using (9) we thus have:
(10) 
Recall that is strongly convex, i.e. for any we have:
We minimize the above inequality on both sides with respect to . We obtain that minimizes the lefthand side, while