# Weighted Distributed Differential Privacy ERM: Convex and Non-convex

###### Abstract

Distributed machine learning is an approach allowing different parties to learn a model over all data sets without disclosing their own data. In this paper, we propose a weighted distributed differential privacy (WD-DP) empirical risk minimization (ERM) method to train a model in distributed setting, considering different weights of different clients. We guarantee differential privacy by gradient perturbation, adding Gaussian noise, and advance the state-of-the-art on gradient perturbation method in distributed setting. By detailed theoretical analysis, we show that in distributed setting, the noise bound and the excess empirical risk bound can be improved by considering different weights held by multiple parties. Moreover, considering that the constraint of convex loss function in ERM is not easy to achieve in some situations, we generalize our method to non-convex loss functions which satisfy Polyak-Lojasiewicz condition. Experiments on real data sets show that our method is more reliable and we improve the performance of distributed differential privacy ERM, especially in the case that data scale on different clients is uneven.

## Introduction

In recent years, machine learning has been widely used in many fields such as data mining and pattern recognition [He et al.2015, Xu, Ni, and Yang2018, Wang et al.2018, Zhang et al.2019]. Because of the need of data for training machine learning algorithms, tremendous data is collected by individuals and companies. As a result, sensitive information disclosure is becoming a huge problem. In addition to data itself, model parameters trained by data can reveal sensitive information in an undirect way as well.

To solve the problems mentioned above, differential privacy [Dwork2011] is proposed to preserve privacy in machine learning and has been applied to principal component analysis (PCA) [Chaudhuri, Sarwate, and Sinha2013, Ge et al.2018, Wang and Xu2019b], regression [Chaudhuri and Monteleoni2009, Zhang et al.2012, Jayaraman et al.2018], boosting [Dwork, Rothblum, and Vadhan2010, Zhao et al.2018], deep learning [Shokri and Shmatikov2015, Abadi et al.2016, Farquhar and Gal2019] and other fields.

There are mainly three methods to achieve differential privacy: output perturbation [Dwork et al.2006, Pathak, Rane, and Raj2010, Bassily, Smith, and Thakurta2014, Zhang et al.2017], objective perturbation [Chaudhuri and Monteleoni2009, Chaudhuri, Monteleoni, and Sarwate2011] and gradient perturbation [Bassily, Smith, and Thakurta2014, Abadi et al.2016, Geyer, Klein, and Nabi2017]. Among them, gradient perturbation is the most popular method because it can be applied to any gradient descent method, which makes it general, and it not only protects the results but also the gradients, which makes it more reliable.

With the development of organizations’ corporation, multiple parties’ desire to train models by combining data is becoming stronger, such as biomedicine and financial fraud detection. In these situations, different parties want to use all the data without disclosing their own data, which brings more press on privacy preserving. Moreover, the number of data instances owned by different parties always varies greatly, making the performance degrade significantly.

Gaussian Noise Bound | Excess Empirical Risk Bound | Distributed | Non-convex | |
---|---|---|---|---|

\citeauthor7 \shortcite7 | None | Yes | No | |

\citeauthor8 \shortcite8 | Yes | No | ||

\citeauthor13 \shortcite13 | No | Yes | ||

\citeauthor11 \shortcite11 | No | Yes | ||

Our Method WD-DP | Yes | Yes |

Distributed machine learning is an approach to solve multiple-party learning problem. Among many distributed learning strategies, divide and conquer is simple and effective. It preserves privacy by minimizing information communications, which has caused widespread concern of researchers. \citeauthor7 \shortcite7 proposed the first distributed differential privacy machine learning method, whose privacy is preserved by output perturbation. \citeauthor8 \shortcite8 introduced differential privacy distributed methods using output perturbation and gradient perturbation, achieving better performance. \citeauthor36 \shortcite36 proposed federated learning method to address distributed machine learning problem, without privacy preserving. Based on [McMahan et al.2016], \citeauthor15 \shortcite15 proposed a method to guarantee that whether a client participants in federated learning cannot be inferred, preserving privacy on federated learning to some extent. \citeauthor37 \shortcite37 proposed a user-level differential privacy method on training LSTM, applied on language models. However, among the work mentioned above, only federated learning based methods consider about weights of parties when aggregating parameters. Considering that in real scenarios, data scale on different clients is always uneven, simply averaging without weights leads worse performance. Moreover, most work assumes that loss function is convex in theoretical analysis, but not considers about the non-convex condition.

To address the problems above, in this paper, we propose Weighted Distributed Differential Privacy (WD-DP) ERM method based on divide and conquer distributed method, applying gradient perturbation by adding Gaussian noise to guarantee -differential privacy. We consider about different weights owned by different parties instead of simply averaging when aggregating models’ parameters to reduce the negative impact caused by uneven data scale, which leads a better noise bound and excess empirical risk bound theoretically. Experiments on real data sets show that the performance of our method is much better than the method proposed in [Jayaraman et al.2018], the best method in distributed differential privacy ERM we know. Moreover, considering the fact that most previous theoretical analysis on differential privacy ERM is based on convex functions and this constraint is not easy to guarantee in some situations, first, we improve the proof process of the excess empirical risk bound proposed by \citeauthor11 \shortcite11 in centralized setting and then generalize our method to non-convex functions which satisfy the Polyak-Lojasiewicz condition in distributed setting.

The rest of the paper is organized as follows. We introduce some related work on distributed differential privacy ERM methods and centralized differential privacy ERM under non-convex condition in Section 2. We propose our method WD-DP in detail and then analyze the -differential privacy of our method in Section 3. We give the theoretical analysis of the excess empirical risk bound of our method on both convex and non-convex conditions in Section 4. We present the experimental results in Section 5. Finally, we conclude the paper in Section 6.

## Related Work

In this section, we first introduce some related work over distributed differential privacy machine learning. Then, we introduce some work on centralized differential privacy ERM under non-convex condition.

### Distributed Setting

\citeauthor7 \shortcite7 proposed a distributed privacy preserving protocol whose objective function has a regularization term , where is the model with parameters. Different parties train models locally and interact with curator to construct additive shares of a perturbed aggregated model. This work guarantees differential privacy by output perturbation, adding Laplace noise. In this work, parameters’ delivery relies on homomorphic encryption [Paillier1999], which is expensive on computation.

8 \shortcite8 introduced a distributed learning method, combining privacy with secure multi-party computation (SMC) [Tian et al.2016]. This work guarantees differential privacy by output perturbation and gradient perturbation, adding noise within a SMC. The noise bound and excess empirical risk bound are better than in [Pathak, Rane, and Raj2010] by assuming the loss function is -Lipschitz and -smooth. Particularly, in this method, parties aggregate parameters by simply averaging. As a result, if the number of data instances on parties is not even, the performance will decrease rapidly. Unfortunately, in real scenarios, data scale on clients is always uneven.

36 \shortcite36 proposed a decentralized method to solve the problem of distributed machine learning, Federated Learning, and applied it on deep networks. This method leaves training data distributed on different parties and learns a shared model by aggregating local models. This work considers about different weights of different parties when aggregating models, but does not consider much about privacy preserving, without theoretical analysis on privacy or utility.

In the method WD-DP, proposed by this paper, by considering about different weights held by different parties when aggregating parameters, we achieve better performance both theoretically and practically, no matter data scale is even or not. So, our method is more general and adapt to most scenarios. The comparison between our method and other methods mentioned above on noise bound and excess empirical risk bound is given in Table 1.

It can be observed in Table 1 that without considering about regularization term, noise bound and excess empirical risk bound of our method are better than the best distributed method we have known, proposed by \citeauthor8 \shortcite8, by a factor of and , respectively, where is the number of parties, denotes the smallest size of data set owned by parties and represents the total number of data instances over all data sets. Particularly, our method is much better than the method proposed by \citeauthor8 \shortcite8 when data scale is uneven on clients and remains the same performance under even data scale. In other words, the method mentioned above is a special case of our method WD-DP under average setting. And obviously, the excess empirical risk bound of our method is much tighter than which in [Pathak, Rane, and Raj2010]. It is worth emphasizing that although our method is proposed under distributed setting, it achieves almost the same theoretical performance as centralized methods.

### Non-convex ERM

\citeauthor13 \shortcite13 proposed Random Round Private SGD, guaranteeing -differential privacy over non-convex function. It is the first theoretical result on centralized non-convex differentially private ERM problem. In this method, the excess empirical risk bound is proportional to , the upper bound of the norm of the model’s parameters.

11 \shortcite11 gave theoretical analyses on noise bound and excess empirical risk bound of gradient perturbation under non-convex condition in centralized setting, assuming the iteration number is . However, the proof process on the excess empirical risk bound of this method can be better, leading a tighter excess empirical risk bound.

21 \shortcite21 studied the centralized differential privacy ERM problem with non-convex loss functions and gave upper bounds for the utility. This work considers the problem in both low and high dimensional space and shows that for some special non-convex loss functions, the utility can be improved to a level similar to convex ones.

In this paper, first, we improve the proof process on excess empirical risk bound in [Wang, Ye, and Xu2017]. Then, considering there is no previous theoretical analysis over distributed non-convex differential private ERM, we extend this method to distributed setting in which loss function is not constrained convex. The comparison between our method and these centralized methods under non-convex condition on noise bound and excess empirical risk bound is given in Table 1.

It can be observed in Table 1 that by improving the proof process proposed by \citeauthor11 \shortcite11, our excess empirical risk bound is tighter than before by a factor of . Meanwhile, considering the parameter is hard to control, our method is more reliable than which proposed by \citeauthor13 \shortcite13, with a tighter noise bound.

## WD-DP: Weighted Distributed Differential Privacy Empirical Risk Minimization

In this section, we first introduce some basic definitions and the Empirical Risk Minimization in distributed setting. Then, we propose our method WD-DP in detail and give theoretical analysis of -DP over our algorithm.

Given -dimensional vector =, denote its -norm as =. is similar to , but hiding factors polynomial in and . Denote the probability distribution of data as , for two databases differing by one single element, they are denoted as , called adjacent databases.

###### Definition 1.

[Dwork et al.2006] A randomized function is -differential privacy (-DP) if

where range() and is the number of parameters.

According to the definition, differential privacy requires that data sets lead to similar distributions on the output of a randomized algorithm . This implies that an adversary will draw essentially the same conclusions about an individual whether or not that individual’s data was used even if many records are known a priori to the adversary.

The centralized ERM objective function is defined as:

where denotes data instance, is the loss function.

### Distributed Differential Privacy

Suppose there are parties , owning data sets with size respectively. Parties train their own model locally to prevent data disclosing (in this paper, we denote model by parameters), and then their models are aggregated by a trusted third party (called server).

So, in distributed setting, considering all the parties, the objective function is:

(1) |

where party ’s data instances are denoted as .

By equation (1), when it comes to gradient perturbation, considering about round , with learning rate , we have the updating criteria on party :

and the updating criteria on server after local iterations is:

(2) |

where represents the objective function over party , is Gaussian noise guaranteeing differential privacy and denotes the aggregated model on server.

### Weighted Distributed Differential Privacy

Traditional methods use equation (2) to aggregate parameters by simply averaging. However, this method pays more attention on data instances in small data sets, which leads worse noise bound and excess empirical risk bound. Considering that data scale on clients in real scenarios is always uneven, simply averaging leads worse performance.

So, to solve the problem mentioned above, instead of simply averaging the parameters when aggregating models, we consider the weights of different parties related to their data sets’ size, which leads updating criteria on the server to:

When considering about weights of different parties, data instances in different parties are paid same attention, which reduces the negative impact caused by a single bad data instance, rare but special high noise generated for guaranteeing differential privacy or uneven data scale.

Our method is detailed in Algorithm 1. Note that in Algorithm 1, we assume that the size of data sets are public knowledge, like in [McMahan et al.2016].

###### Differential Privacy.

In this paper, we guarantee -DP using Gaussian Mechanism proposed by \citeauthor9 \shortcite9 and moments accountant introduced by \citeauthor6 \shortcite6.

###### Theorem 1.

In Algorithm 1, for , if is G-Lipschitz over and

(3) |

it is -DP for some constant .

###### Proof.

Consider the query which may disclose privacy:

(4) | ||||

where represents after local rounds.

In moments accountant method proposed by \citeauthor6 \shortcite6, the () moment on mechanism is defined as:

(5) |

where is privacy loss at output , defined as:

(6) |

In order to preserve privacy, it is necessary to bound all possible . So, is defined as:

Denote probability distributions on adjacent databases and over mechanism as and :

By Definition 2.1 in [Bun and Steinke2016], define as:

(7) |

By equations (5), (6), (7) and definitions , , we have equations below over mechanism :

By Lemma 2.5 in [Bun and Steinke2016], we have:

Note that is -Lipschitz (denoted as below), and there is only one single element different between and , suppose it is the one, we have:

Thus,

By Theorem 2.1 in [Abadi et al.2016], we have:

Then, note that , we have:

Taking for some constant , we can guarantee that:

and as a result, we have:

which means:

means -DP due to Theorem 2.2 in [Abadi et al.2016]. ∎

In Theorem 1, is -Lipschitz, but not constrained convex. Thus, Theorem 1 is general in both convex and non-convex conditions.

Although Algorithm 1 considers distributed setting, equation (4) is not related to the number of parties . As a result, Gaussian noise guaranteeing -DP is not related to multiple parties, but has the same form as in centralized setting.

Moreover, we consider moments accountant method and different weights held by parties when aggregating models, so the bound is tighter than which introduced by \citeauthor8 \shortcite8 by a factor of , where is the smallest size of data sets owned by all the parties. When data scale on clients is not even, our method is much better.

## Theoretical Analysis over Convex and Non-convex Conditions

In this section, first we give the analysis of excess empirical risk of our method WD-DP under convex condition and then generalize it to non-convex functions which satisfy the Polyak-Lojasiewicz condition. To our knowledge, this is the first theoretical analysis of excess empirical risk bound on non-convex distributed differential privacy ERM.

### Convex

In this part, we give the theoretical analysis of excess empirical risk when objective function is -strongly convex.

###### Theorem 2.

Suppose that is G-Lipschitz and L-smooth over , is -strongly convex and differentiable, with is the same as in (3) and learning rate . We have:

where , and is the number of parameters.

###### Proof.

According to updating criteria of gradient descent:

Function is -smooth (denoted as below) and is differentiable (denoted as below), we have:

(8) |

is -strongly convex and differentiable, from [Csiba and Richtárik2017], we have:

(9) |

For random variable , we have:

(10) |

where denotes variance of .

So, by (9) and (10), inequality (8) can be transferred to:

Then, summing over iterations, we have:

(11) | ||||

Taking , we have:

∎

###### Remark 1.

In [Wang, Ye, and Xu2017], inequality (11) is simply scaling to:

(12) |

Obviously, equation (11) is tighter than equation (12). In this way, we improve the proof process, leading a better excess empirical risk bound by a factor of log(n).

It can be observed that our method is better than which in distributed setting [Jayaraman et al.2018] and centralized setting [Wang, Ye, and Xu2017] by a factor of and , respectively. Intuitively, giving weights to parties means data instances owned by all parties are of the same importance, similar to centralized setting. Conversely, simply averaging gives more weight to data instances in smaller data sets, making it more distributed.

### Non-convex

In this part, we generalize the analysis above to non-convex which satisfies the Polyak-Lojasiewicz condition.

###### Definition 2.

For function , denotes , if there exists and for every ,

(13) |

then function satisfies the Polyak-Lojasiewicz condition.

Obviously, convex functions also satisfy equation (13).
In fact, Polyak-Lojasiewicz condition is much more general than convex.
\citeauthor10 \shortcite10 showed that when function is differentiable and -smooth under norm, we have:

Strong Convex Essential Strong Convexity Weak Strongly Convexity Restricted Secant Inequality Polyak-Lojasiewicz Inequality Error Bound

###### Theorem 3.

Suppose that is G-Lipschitz and L-smooth over , satisfies the Polyak-Lojasiewicz condition and differentiable, is the same as (3) and . We have:

where , and is the number of parameters.

The proof of Theorem 3 is shown in Appendix A.1.

It can be observed that our excess empirical risk bound over both convex function and non-convex function which satisfies the Polyak-Lojasiewicz condition are tighter than which over convex function in [Jayaraman et al.2018] by a factor of , where is the number of parties and denotes the smallest data set’s size. Under the situation of uneven data scale in real scenarios, the gap between and is huge, our method is extremely superior.

Moreover, by Remark 1, we proof that the excess empirical risk bound can be tighter than which in [Wang, Ye, and Xu2017] by a factor of .

## Experiments

Experiments are performed on classification task. We compare our method to the gradient perturbation method proposed by \citeauthor8 \shortcite8 and the centralized privacy method proposed by \citeauthor11 \shortcite11. The comparison between our method and others is represented by accuracy and optimal gap. Optimal gap is defined as , where is centralized optimal non-privacy model. Accuracy represents the performance on test data and optimal gap denotes excess empirical risk on training data.

We use logistic regression method on the data set KDDCup99 [Hettich and Bay1999], Adult [Dua and Graff2017], Bank [Moro, Cortez, and Rita2014], Breast Cancer [Mangasarian and Wolberg1990] and Credit Card Fraud [Bontempi and Worldline2018], the number of total data instances are 70000, 45222, 41188, 699 and 984, respectively. In all the experiments, , the Lipschitz constant of the loss function (the proof is shown in Appendix A.2). Total local iteration rounds is set 1000 and learning rate is chosen by cross-validation from 0.01, 0.05, 0.1, 0.5 and 1.

First, we evaluate the influence over differential privacy budget , is set from 0.01 to 0.25 and . In this setting, we set the number of clients and the number of data instances owned by each client is set randomly. Moreover, considering about the real scenario, we set a threshold of the smallest data set’s size, ensuring effective models are trained by clients. Then, we evaluate the influence caused by the difference on data sets’ size owned by different clients. We define the level of non-average as , where and denote the maximum and minimum data set’s size, respectively. In the experiments, considering the number of total data instances, is set from 1 to 9 on data set KDDCup99, Adult and Bank, while from 1 to 5 on the rest data sets. Particularly, means average setting. For the sake of simplicity, we divide all the clients into 2 groups, clients in the same group have the same data sets’s size.

Figure 1 shows the accuracy over privacy budget . It can be observed that by considering different weights of different clients, our method WD-DP is better than the method proposed in [Jayaraman et al.2018] and is similar to the centralized method proposed in [Wang, Ye, and Xu2017]. Performance is becoming better when increases, which is the same as intuition. Corresponding optimal gap and more experiment results over with different number of clients are shown in Appendix B, leading similar results.

Figure 2 shows the accuracy over the level of non-average . It can be observed that in average setting, when , the accuracy of our proposed method WD-DP is similar to the method proposed in [Jayaraman et al.2018]. However, when increases, which means data scale is more and more uneven, the accuracy of our method is steady, but the accuracy of which proposed by \citeauthor8 \shortcite8 decreases rapidly or fluctuates sharply. Thus, our method is more reliable, especially in the case of uneven data scale, which is the same as in theoretical analysis. Corresponding optimal gap and more experiment results over with different number of clients and privacy budget are also shown in Appendix B, which lead similar results.

## Conclusion and Discussion

We propose a distributed differential privacy ERM method WD-DP, providing -differential privacy by gradient perturbation, adding Gaussian noise. Our work shows that by considering about different weights of different clients, noise bound and excess empirical risk bound can be improved in distributed setting. Moreover, considering most previous work on differential privacy ERM assumes loss functions are convex and this constraint is not easy to achieve in some situations, we generalize our method to non-convex conditions. Theoretical analysis and experiment results on real data sets show that we improve the best previous noise bound and excess empirical risk bound for distributed differential privacy ERM, especially under the condition that data scale on clients is uneven, which is common in real scenarios. In future work, we will focus on non-convex optimization under distributed differential privacy ERM setting (e.g. deep learning), and reducing time complexity of the model, considering most models are synchronous.

## References

- [Abadi et al.2016] Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H. B.; Mironov, I.; Talwar, K.; and Zhang, L. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 308–318.
- [Bassily, Smith, and Thakurta2014] Bassily, R.; Smith, A.; and Thakurta, A. 2014. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, 464–473.
- [Bontempi and Worldline2018] Bontempi, G., and Worldline. 2018. ULB the machine learning group.
- [Bun and Steinke2016] Bun, M., and Steinke, T. 2016. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, 635–658. Springer.
- [Chaudhuri and Monteleoni2009] Chaudhuri, K., and Monteleoni, C. 2009. Privacy-preserving logistic regression. In Advances in neural information processing systems, 289–296.
- [Chaudhuri, Monteleoni, and Sarwate2011] Chaudhuri, K.; Monteleoni, C.; and Sarwate, A. D. 2011. Differentially private empirical risk minimization. Journal of Machine Learning Research 12(Mar):1069–1109.
- [Chaudhuri, Sarwate, and Sinha2013] Chaudhuri, K.; Sarwate, A. D.; and Sinha, K. 2013. A near-optimal algorithm for differentially-private principal components. The Journal of Machine Learning Research 14(1):2905–2943.
- [Csiba and Richtárik2017] Csiba, D., and Richtárik, P. 2017. Global convergence of arbitrary-block gradient methods for generalized polyak-lojasiewicz functions. arXiv preprint arXiv:1709.03014.
- [Dua and Graff2017] Dua, D., and Graff, C. 2017. UCI machine learning repository.
- [Dwork et al.2006] Dwork, C.; McSherry, F.; Nissim, K.; and Smith, A. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, 265–284. Springer.
- [Dwork, Rothblum, and Vadhan2010] Dwork, C.; Rothblum, G. N.; and Vadhan, S. 2010. Boosting and differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, 51–60.
- [Dwork2011] Dwork, C. 2011. Differential privacy. Encyclopedia of Cryptography and Security 338–340.
- [Farquhar and Gal2019] Farquhar, S., and Gal, Y. 2019. Differentially private continual learning. arXiv preprint arXiv:1902.06497.
- [Ge et al.2018] Ge, J.; Wang, Z.; Wang, M.; and Liu, H. 2018. Minimax-optimal privacy-preserving sparse pca in distributed systems. In International Conference on Artificial Intelligence and Statistics, 1589–1598.
- [Geyer, Klein, and Nabi2017] Geyer, R. C.; Klein, T.; and Nabi, M. 2017. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557.
- [He et al.2015] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, 1026–1034.
- [Hettich and Bay1999] Hettich, S., and Bay, S. D. 1999. The uci kdd archive [http://kdd.ics.uci.edu]. irvine, ca: University of california, department of information and computer science.
- [Jayaraman et al.2018] Jayaraman, B.; Wang, L.; Evans, D.; and Gu, Q. 2018. Distributed learning without distress: Privacy-preserving empirical risk minimization. In Advances in Neural Information Processing Systems, 6343–6354.
- [Karimi, Nutini, and Schmidt2016] Karimi, H.; Nutini, J.; and Schmidt, M. 2016. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 795–811. Springer.
- [Mangasarian and Wolberg1990] Mangasarian, O. L., and Wolberg, W. H. 1990. Cancer diagnosis via linear programming. Technical report, University of Wisconsin-Madison Department of Computer Sciences.
- [McMahan et al.2016] McMahan, H. B.; Moore, E.; Ramage, D.; Hampson, S.; et al. 2016. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629.
- [McMahan et al.2017] McMahan, H. B.; Ramage, D.; Talwar, K.; and Zhang, L. 2017. Learning differentially private recurrent language models. arXiv preprint arXiv:1710.06963.
- [Moro, Cortez, and Rita2014] Moro, S.; Cortez, P.; and Rita, P. 2014. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems 62:22–31.
- [Paillier1999] Paillier, P. 1999. Public-key cryptosystems based on composite degree residuosity classes. In International Conference on the Theory and Applications of Cryptographic Techniques, 223–238. Springer.
- [Pathak, Rane, and Raj2010] Pathak, M.; Rane, S.; and Raj, B. 2010. Multiparty differential privacy via aggregation of locally trained classifiers. In Advances in Neural Information Processing Systems, 1876–1884.
- [Shokri and Shmatikov2015] Shokri, R., and Shmatikov, V. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, 1310–1321.
- [Tian et al.2016] Tian, L.; Jayaraman, B.; Gu, Q.; and Evans, D. 2016. Aggregating private sparse learning models using multi-party computation. In NIPS Workshop on Private Multi-Party Machine Learning.
- [Wang and Xu2019a] Wang, D., and Xu, J. 2019a. Differentially private empirical risk minimization with smooth non-convex loss functions: A non-stationary view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 1182–1189.
- [Wang and Xu2019b] Wang, D., and Xu, J. 2019b. Principal component analysis in the local differential privacy model. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI), 4795–4801. International Joint Conferences on Artificial Intelligence Organization.
- [Wang et al.2018] Wang, Y.; Zhou, W.; Zhang, Q.; and Li, H. 2018. Convolutional neural networks with generalized attentional pooling for action recognition. In 2018 IEEE Visual Communications and Image Processing (VCIP), 1–4.
- [Wang, Ye, and Xu2017] Wang, D.; Ye, M.; and Xu, J. 2017. Differentially private empirical risk minimization revisited: Faster and more general. In Advances in Neural Information Processing Systems, 2722–2731.
- [Xu, Ni, and Yang2018] Xu, J.; Ni, B.; and Yang, X. 2018. Video prediction via selective sampling. In Advances in Neural Information Processing Systems, 1705–1715.
- [Zhang et al.2012] Zhang, J.; Zhang, Z.; Xiao, X.; Yang, Y.; and Winslett, M. 2012. Functional mechanism: regression analysis under differential privacy. Proceedings of the VLDB Endowment 5(11):1364–1375.
- [Zhang et al.2017] Zhang, J.; Zheng, K.; Mou, W.; and Wang, L. 2017. Efficient private erm for smooth objectives. arXiv preprint arXiv:1703.09947.
- [Zhang et al.2019] Zhang, H.; Wheldon, C.; Dunn, A. G.; Tao, C.; Huo, J.; Zhang, R.; Prosperi, M.; Guo, Y.; and Bian, J. 2019. Mining twitter to assess the determinants of health behavior towards human papillomavirus vaccination in the united states. arXiv preprint arXiv:1907.11624.
- [Zhao et al.2018] Zhao, L.; Ni, L.; Hu, S.; Chen, Y.; Zhou, P.; Xiao, F.; and Wu, L. 2018. Inprivate digging: Enabling tree-based distributed data mining with differential privacy. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, 2087–2095.

## Appendix A A. More Details and Proofs

### A.1 Proof of Theorem 3

###### Proof.

The proof is similar to Theorem 2.

According to updating criteria of gradient descent:

Function is -smooth (denoted as below) and is differentiable (denoted as below), we have:

(14) |

Note that satisfies the Polyak-Lojasiewicz condition, then we have:

(15) |

For random variable , we have:

(16) |

where denotes variance of .

So, by (15) and (16), inequality (14) can be transferred to:

Then, summing over iterations, we have:

Taking , we have:

∎

### A.2 The Lipschitz Constant of Logistic Regression when Using Cross-Entropy

When using cross-entropy, the loss function of logistic regression is:

(17) |

where and .

###### Proof.

From (17), we have:

Then, note that , we have:

Note that , and is normalized, we have , which means the Lipschitz constant of is . ∎

## Appendix B B. More Experimental Results

We give the accuracy and optimal gap (defined as , is centralized optimal non-privacy model) over privacy budget from figure 3 to figure 11. Figure 3, 5, 7, 9 and 11 show that by considering different weights held by clients, the optimal gap of our method is better than the distributed method proposed in [Jayaraman et al.2018] and is similar to the centralized method proposed in [Wang, Ye, and Xu2017], which means the excess empirical risk of our method is similar to centralized methods. With the increasing of , the optimal gap decreases, which is the same as intuition. Figure 4, 6, 8 and 10 show that the accuracy of our method is better than the method proposed by \citeauthor8 \shortcite8 by considering weights of parties. Experimental results show that the performance of our method is straight up to centralized methods, which is similar to the theoretical analysis in Section 5.

Figure 12 to figure 16 show the accuracy and optimal gap over the level of non-average . In this setting, the value and are chosen randomly. Figure 12, 14 and 16 show that with the increasing of , which means data scale on different clients is more and more uneven, the optimal gap of our method is steady. However, the optimal gap of the method proposed by \citeauthor8 \shortcite8 increases rapidly or fluctuates sharply. Figure 13 and 15 show that the accuracy of our method is steady with the increasing of , while the accuracy of the method proposed in [Jayaraman et al.2018] decreases rapidly or fluctuates sharply. Thus, our method is more reliable than the method proposed by \citeauthor8 \shortcite8, especially in the case that data scale is uneven, similar to the theoretical analysis in Section 5.