DPLSSGD: A Stochastic Optimization Method to Lift the Utility in PrivacyPreserving ERM
Abstract
Machine learning (ML) models trained by differentially private stochastic gradient descent (DPSGD) has much lower utility than the nonprivate ones. To mitigate this degradation, we propose a DP Laplacian smoothing SGD (DPLSSGD) for privacypreserving ML. At the core of DPLSSGD is the Laplace smoothing operator, which smooths out the Gaussian noise vector used in the Gaussian mechanism. Under the same amount of noise used in the Gaussian mechanism, DPLSSGD attains the same differential privacy guarantee, but a strictly better utility guarantee, excluding an intrinsic term which is usually dominated by the other terms, for convex optimization than DPSGD by a factor which is much less than one. In practice, DPLSSGD makes training both convex and nonconvex ML models more efficient and enables the trained models to generalize better. For ResNet20, under the same strong differential privacy guarantee, DPLSSGD can lift the testing accuracy of the trained private model by more than % compared with DPSGD. The proposed algorithm is simple to implement and the extra computational complexity and memory overhead compared with DPSGD are negligible. DPLSSGD is applicable to train a large variety of ML models, including deep neural nets. The code is available at https://github.com/BaoWangMath/DPLSSGD.
1 Introduction
Many released machine learning (ML) models are trained on sensitive data that are often crowdsourced or contains personal private information [42, 14, 25]. With a large number of parameters, deep neural nets (DNNs) can memorize the sensitive training data, and it is possible to recover the sensitive data and break the privacy by attacking the released models [33]. For example, Fredrikson et al. demonstrated a modelinversion attack can recover training images from a facial recognition system [15]. Protecting the privacy of sensitive training data is one of the most critical tasks in ML.
Differential privacy (DP) [11, 10] is a theoretically rigorous tool for designing algorithms on aggregated databases with a privacy guarantee. The basic idea is to add a certain amount of noise to randomize the output of a given algorithm such that the attackers cannot distinguish outputs of any two adjacent input datasets that differ in only one entry. Two types of noises are typically injected to the algorithm for DP guarantee: Laplace noise and Gaussian noise [11].
For repeated applications of additive noise based mechanisms, many tools are invented to analyze the DP guarantee for the model obtained at the final stage. These include the basic composition theorem [9, 8], the strong composition theorem and their refinements [13, 23], the momentumaccountant [1], etc. Beyond the original definition of DP, there are also many other ways to define the privacy, e.g., local DP [7], concentrated/zeroconcentrated DP [12, 4], and RényiDP (RDP) [26].
Differentially private stochastic gradient descent (DPSGD) reduces the utility of the trained model severely compared with SGD. As shown in Fig. 1, the training and validation loss of the logistic regression increase when the DP guarantee becomes stronger. The ResNet20 trained by DPSGD has much lower testing accuracy than nonprivate ResNet20 on the CIFAR10. A natural question is:
Can we improve DPSGD, with negligible extra computational complexity and memory cost, such that it can be used to train general ML models with better utility?
We answer the above question affirmatively by proposing differentially private Laplacian smoothing SGD (DPLSSGD). It gives both theoretical and empirical advantages compared with DPSGD.
1.1 Our Contributions
The main contributions of our work are highlighted as follows:

We propose DPLSSGD and prove its privacy and utility guarantees for convex/nonconvex optimizations. We prove that under the same privacy budget, DPLSSGD achieves better utility, excluding an intrinsic term that usually dominated by the other terms, than DPSGD by a factor that is much less than one for convex optimization.

We perform a large number of experiments on logistic regression, SVM, and ResNet to verify the utility improvement by using DPLSSGD. These results show that DPLSSGD remarkably reduces training and validation loss and improves the generalization of the trained private models.
In Table 1, we compare the privacy and utility guarantees of DPLSSGD and DPSGD. For the utility, the notation hides the same constant and log factors for each bound. The constants and denote the dimension of the model’s parameters and the number of training points, respectively. The numbers and are positive constants that are strictly less than one, and are positive constants, which will be defined later.
Algorithm  DP Guarantee  Assumption  Utility  Measurement  Reference 

DPSGD  convex  optimality gap  [3]  
DPSGD  nonconvex  norm of gradient  [43]  
DPLSSGD  convex  optimality gap  This Work  
DPLSSGD  nonconvex  ^{1}  norm of gradient  This Work 

Measured in the norm induced by , we will discuss this in detail in Section 4.
1.2 Additional Related Work
There is a massive volume of research over the past decade on designing algorithms for privacypreserving ML. Objective perturbation, output perturbation, and gradient perturbation are the three major approaches to perform empirical risk minimization (ERM) with DP guarantee. We discuss some related works in this part. There are many more exciting works that cannot be discussed here.
Chaudhuri et al. considered both output and objective perturbations for privacypreserving ERM, and gave theoretical guarantees for both privacy and utility for logistic regression and SVM [5, 6]. Song et al. numerically studied the effects of learning rate and batch size in DPERM [34]. Wang et al. studied stability, learnability and other properties of DPERM [39]. Lee et al. proposed an adaptive periteration privacy budget in concentrated DP gradient descent [24]. Variance reduction techniques, e.g., SVRG, have also been introduced to DPERM [37]. The utility bound of DPSGD has also been analyzed for both convex and nonconvex smooth objectives [3, 43]. Jayaraman et al. analyzed the excess empirical risk of DPERM under the distributed setting [21]. Besides ERM, many other ML models have been made differentially private. These include: clustering [35, 41, 2], matrix completion [20], online learning [19], sparse learning [36, 38], and topic modeling [30]. Gilbert et al. exploited the illconditionedness of inverse problems to design algorithms to release differentially private measurements of the physical system [17].
Shokri et al. proposed distributed selective SGD to train deep neural nets (DNNs) with a DP guarantee in a distributed system; they achieved quite a successful tradeoff between privacy and utility [32]. Abadi et al. considered applying DPSGD to train DNNs in a centralized setting. They clipped the gradient to bound the sensitivity and invented the momentum accountant to get better privacy loss estimation [1]. Papernot et al. proposed Private Aggregation of Teacher Ensembles/PATE based on the semisupervised transfer learning to train DNNs and to protect the privacy of the private data [28]. Recently Papernot et al. introduced new noisy aggregation mechanisms for teacher ensembles that enable a tighter theoretical DP guarantee. The modified PATE is scalable to the large dataset and applicable to more diversified ML tasks [29]. Geyer et al. considered general ML with a DP guarantee under federated settings [16]. Rahman et al. numerically studied the vulnerability and privacyutility tradeoff of DNNs trained with a DP guarantee to adversarial attacks [31].
1.3 Notation
We use boldface uppercase letters , to denote matrices and boldface lowercase letters , to denote vectors. For vectors and and positive definite matrix , we use and to denote the norm and the induced norm by , respectively; denotes the inner product of and ; and denotes the th largest eigenvalue of . We denote the set of numbers from to by .
1.4 Organization
This paper is organized in the following way: In Section 2, we introduce the DPLSSGD algorithm, which merely injects an appropriate Gaussian noise to guarantee the privacy of LSSGD. In Section 3, we analyze the privacy and utility guarantees of DPLSSGD for both convex and nonconvex optimizations. We numerically verify the efficiency of DPLSSGD in Section 4. We conclude this work and point out some future directions in Section 5.
2 Problem Setup and Algorithm
2.1 Laplacian Smoothing Stochastic Gradient Descent (LSSGD)
Consider the following finitesum optimization
(1) 
where is the loss of a given ML model on the training data . This finitesum optimization problem is the mathematical formulation for training many ML models that are mentioned above. The LSSGD [27] for solving this finitesum optimization is given by
(2) 
where is the learning rate, and is a random sample from . Let where and are the identity and the discrete onedimensional Laplacian matrix, respectively. Therefore,
(3) 
for being a constant. When , LSSGD reduces to SGD.
This Laplacian smoothing can help to avoid spurious minima, reduce the variance of SGD onthefly, and lead to better generalization in training many ML models including DNNs. Computationally, we use the fast Fourier transform (FFT) to perform gradient smoothing in the following way
where is any stochastic gradient vector and .
2.2 DpLssgd
We propose the following DPLSSGD algorithm to resolve the finitesum optimization in Eq. (1)
(4) 
where denotes the gradient of the total loss function evaluated from the database and is the injected Gaussian noise. In this scheme, we first add the noise to the stochastic gradient vector , and then apply the operator to smooth the noisy stochastic gradient onthefly. We assume that each component function in Eq. (1) is Lipschitz. The DPLSSGD algorithm for finitesum optimization is summarized in Algorithm 1.
3 Main Theory
In this section, we present the privacy and utility guarantees for DPLSSGD. The technical proofs are provided in the appendix.
Definition 1 (Dp).
([11]) A randomized mechanism satisfies differential privacy if for any two adjacent data sets differing by one element, and any output subset , it holds that
Theorem 1 (Privacy Guarantee).
Suppose that each component function is Lipschitz. Given the total number of iterations , for any and privacy budget , DPLSSGD, with injected Gaussian noise for each coordinate, satisfies differential privacy with , where .
Remark 1.
It is straightforward to show that the noise in Theorem 1 is in fact also tight to guarantee the differential privacy for DPSGD, since the same amount of Gaussian noise guarantees the same differential privacy for both DPSGD and DPLSSGD.
For convex ERM, DPLSSGD guarantees the following utility bound in terms of the gap between the ergodic average of the points along the DPLSSGD path and the optimal solution .
Theorem 2 (Utility Guarantee for convex optimization).
Suppose is convex and each component function is Lipschitz. Given any , if we choose and , where and is the global minimizer of , the DPLSSGD output satisfies the following utility
where .
Proposition 1.
In Theorem 2, where
Remark 2.
Compared with the extra utility bound of DPSGD , DPLSSGD has a strictly better extra utility bound by a factor of , except for the term . In practice, for both logistic regression and SVM, is dominated by , and DPLSSGD improves the utility of both models.
For nonconvex ERM, DPLSSGD has the following utility bound measured in gradient norm.
Theorem 3 (Utility Guarantee for nonconvex optimization).
Suppose that is nonconvex and each component function is Lipschitz and has Lipschitz continuous gradient. Given any , if we choose and , where with being the global minimum of , then the DPLSSGD output satisfies the following utility
where .
Proposition 2.
In Theorem 3, where and
The number is also strictly between and and . It is worth noting that if we use the norm instead of the induced norm, we have the following utility guarantee
where . In the norm, DPLSSGD has a bigger utility upper bound than DPSGD (set in ). However, this does not mean that DPLSSGD has worse performance. To see this point, let us consider the following simple nonconvex function
(5) 
For two points and , the distance to the local minima are and , while and . So is closer to the local minima than while its gradient has a larger norm. This example shows is not the optimal measure in comparing the utility bound for nonconvex optimization. We will further verify this in Section 4.
4 Numerical Results
In this section, we verify the efficiency of the proposed DPLSSGD in training multiclass logistic regression, SVM, and ResNet20. We perform ResNet20 experiments on the CIFAR10 dataset with standard data augmentation [18], logistic regression and SVM experiments on the benchmark MNIST classification. Based on the range of gradient values of each model, we use the formula [1] to clip the gradient norms of logistic regression and ResNet20 to and clip the SVM’s gradient norm to . These gradient clippings guarantee the Lipschitz condition for the objective functions. For all experiments, we train both logistic regression and SVM with DP guarantee, and ResNet20 with DP guarantee. We regard the DPSGD as the benchmark.
4.1 Multiclass Logistic Regression and SVM
For MNIST classification, we ran 50 epochs of DPLSSGD with learning rate scheduled as with being the index of the iteration to train the regularized multiclass logistic regression and SVM (the objective function of both models are strongly convex), using an penalty with regularization coefficient . We split the training data into 50K/10K for crossvalidation. The models with best validation accuracy are used for testing. The batch size is set to .
First, we show that DPLSSGD converges faster than DPSGD and makes the training and validation loss much smaller than DPSGD. We plot the evolution of training and validation loss over iterations for logistic regression (Fig. 2) and SVM (Fig. 3) with DP guarantee. Figures 2 and 3 show that the training loss curve of DPSGD () is much higher and more oscillatory (due to the logscale in axis) than that of DPLSSGD (). The validation loss of both logistic regression and SVM trained by both DPSGD and DPLSSGD decrease as iteration goes. The validation loss of the model trained by DPLSSGD decays faster and has a much smaller loss value than that of the model trained by DPSGD. For both training and validation, DPLSSGD with gives better results than .
(a)  (b) 
(c)  (d) 
(a)  (b) 
(c)  (d) 
0.50  0.45  0.40  0.35  0.30  0.25  0.20  

81.59  81.52  80.07  79.30  78.71  77.80  76.02  
83.64  83.70  82.91  82.33  82.25  79.53  78.01  
84.41  83.45  81.88  83.06  81.39  79.03  78.86  
84.14  83.99  82.17  82.08  81.74  80.90  80.21 
0.50  0.45  0.40  0.35  0.30  0.25  0.20  

78.28  77.41  76.07  74.09  72.98  72.47  70.25  
80.53  79.53  77.77  77.09  75.37  75.89  72.94  
81.72  79.69  79.59  77.99  77.09  76.19  73.94  
80.57  80.11  78.85  77.44  76.92  75.97  73.97 
Second, consider the validation accuracy of the models trained by DPSGD and DPLSSGD. Figure 4 depicts the evolution of the validation accuracy of the trained logistic regression and SVM by DPSGD and DPLSSGD. We plot validation accuracy after every training epoch. It shows that DPLSSGD is almost always better than DPSGD in the sense that DPLSSGD gives better validation accuracy. Different in DPLSSGD give different level of improvement. For these experiments, larger is usually better than the smaller one.
Third, consider the testing accuracy of logistic regression and SVM trained in different scenarios. The corresponding testing accuracy are listed in Tables. 2, and 3. All the numbers reported in the above tables and the tables below are the results averaged over three independent experiments. These results reveal that the multiclass logistic regression model is remarkably more accurate than SVM for various levels of DPguarantee. Both logistic regression and SVM trained by DPLSSGD with are more accurate than that trained by DPSGD over different levels of DPguarantee.
(a)  (b) 
(c)  (d) 
4.0  3.5  3.0  2.5  2.0  1.5  1.0  0.5  

70.08  67.41  65.19  61.13  56.27  51.41  37.92  25.12  
72.20  71.25  68.42  65.32  62.70  58.32  45.05  31.35  
72.06  70.66  68.97  65.59  61.30  58.62  46.28  32.11  
73.61  70.06  68.33  66.96  60.77  57.37  45.14  32.07 
4.1.1 The Choice of
Table 5 lists the testing accuracy (averaged over three runs) of both private logistic regression and SVM trained by DPLSSGD with different . It shows that accuracy improvement is stable to . As increases, the testing accuracy increases initially and then decays. In practice, DPLSSGD is as fast as DPSGD, so for a given objective function we might try a few different to find the optimal one.
0  2  4  6  8  10  12  15  

Logistic Regression DP  81.59  84.41  84.17  84.15  85.20  83.71  83.63  83.26 
Logistic Regression DP  78.71  81.39  80.97  82.75  82.02  81.01  80.94  80.89 
SVM DP  78.28  81.72  80.97  81.11  81.67  81.35  80.80  80.56 
SVM DP  72.98  77.09  77.18  77.02  77.54  77.01  76.05  75.82 
4.2 Deep Learning
We run 100 epochs of DPLSSGD with batch size 128 to train ResNet20 on the CIFAR10. To justify our theoretical results, we apply DPLSSGD without momentum, and no weight decay is used during the training. It is known that Nesterov momentum and weight decay, i.e., the regularization, are helpful to accelerate the convergence and improve the generalization of the trained model. In our future work, we will integrate these techniques into DPLSSGD. We split the training data into 45K/5K for crosss validation. During training, we decay the learning rate by a factor of 10 at the 40th and 80th epoch, respectively. Figure 5 shows the evolution of epoch v.s. training (Fig. 5(a)) and validation losses (Fig. 5(b)) of ResNet20 trained by DPLSSGD with different Laplacian smoothing parameters , and with DP guarantee. We conclude from these two plots that: (i) learning rate decay is still very helpful for DPLSSGD in training DNNs which is wellknown in SGD, as we see that there is a sharp training and validation loss decay at the 40th epoch; (ii) the Laplacian smoothing can reduce both the training and validation losses significantly.
We plot the evolution of epoch v.s. validation accuracy in Fig. 5 (c) which is generally consistent with the evolution of epoch v.s. validation loss. In Fig. 5 (d) we plot the testing accuracy of the trained model by DPLSSGD with different Laplacian smoothing parameters and different with fixed , where the corresponding values of the testing accuracy are listed in Table 4. DPLSSGD can improve testing accuracy up to % when the strong DP is guaranteed. The accuracy improvement is much more significant than that of convex optimization scenario.
(a)  (b) 
(c)  (d) 
DPLSSGD is a complement to the privacy mechanisms proposed in [1] and [29]. In future work, we will integrate DPLSSGD into the algorithms proposed in [1] and [29] to further boost private model’s utility.
4.2.1 Is Gradient Norm the Right Metric for Measuring Utility?
In Section 3 we gave a simple nonconvex function and showed that a point having a smaller gradient norm does not indicate it is closer to the local minima. Now, we will show experimentally that for ResNet20, a smaller gradient norm does not indicate more proximity to the local minima. Figure 6 depicts the epoch (k) v.s. (a), validation accuracy (b), training (c) and validation (d) losses. These plots show that during evolution, though DPLSSGD has a larger gradient norm than DPSGD, it has much better utility in terms of validation accuracy, and training and validation losses.
(a)  (b) 
(c)  (d) 
5 Conclusions
In this paper, we proposed a new differentially private stochastic optimization algorithm, DPLSSGD, inspired by the recently proposed LSSGD. The algorithm is simple to implement and the extra computational cost compared with the DPSGD is almost negligible. We show that DPLSSGD can lift the utility of the trained private ML models both numerically and theoretically. It is straightforward to combine LS with other variance reduction technique, e.g., SVRG [22].
Appendix A Proof of the Main Theorems
a.1 Privacy Guarantee
To prove the privacy guarantee in Theorem 1, we first introduce the following sensitivity.
Definition 2 (Sensitivity).
For any given function , the sensitivity of is defined by
where means the data sets and differ in only one entry.
We will adapt the concepts and techniques of Rényi differential privacy (RDP) to prove the DPguarantee of the proposed DPLSSGD.
Definition 3 (Rdp).
For , a randomized mechanism satisfies Rényi differential privacy, i.e., RDP, if for all adjacent datasets differing by one element, we have
where the expectation is taken over .
Lemma 1.
[40] Given a function , the Gaussian Mechanism , where , satisfies RDP. In addition, if we apply the mechanism to a subset of samples using uniform sampling without replacement, satisfies RDP when , with denoting the subsample rate.
Lemma 2.
[26] If randomized mechanisms , for , satisfy RDP, then their composition satisfies RDP. Moreover, the input of the th mechanism can be based on outputs of the previous mechanisms.
Lemma 3.
If a randomized mechanism satisfies RDP, then satisfies DP for all .
With the definition (Def. 3) and guarantees of RDP (Lemmas 1 and 2), and the connection between RDP and DP (Lemma 3), we can prove the following DPguarantee for DPLSSGD.
Proof of Theorem 1.
Let us denote the update of DPSGD and DPLSSGD at the th iteration starting from any given points and , respectively, as
(6) 
and
(7) 
where are drawn uniformly from .
We will show that with the aforementioned Gaussian noise for each coordinate of , the output of DPSGD, , after iterations is DP. Let us consider the mechanism with the query . We have the sensitivity of as . According to Lemma 1, if we add noise with variance
the mechanism will satisfy RDP. By postprocessing theorem, we immediately have that under the same noise, also satisfies RDP. According to Lemma 1, will satisfy RDP provided that . Let , we obtain that satisfies RDP as long as we have
which implies that
Therefore, according to Lemma 2, we have satisfies RDP. Finally, by Lemma 3, we have satisfies DP. Therefore, the output of DPSGD, , is DP. ∎
Remark 3.
In the above proof, we used the following estimate of the sensitivity
Indeed, let and , then according to [27] we have
where is the dimension of , and
Moreover, if we assume the is randomly sampled from a unit ball in a high dimensional space, then a high probability estimation of the compression ratio of the norm can be derived from Lemma. 5.
Numerical experiments show that is much less than , so for the above noise, it can give much stronger privacy guarantee.
a.2 Utility Guarantee – Convex Optimization
To prove the utility guarantee for convex optimization, we first show that the Laplacian smoothing operator compresses the norm of any given Gaussian random vector with a specific ratio in expectation.
Lemma 4.
Let be the standard Gaussian random vector. Then
where is the square of the induced norm of by the matrix .
Proof of Lemma 4.
Let the eigenvalue decomposition of be , where is a diagonal matrix with We have
∎
Proof of Theorem 2.
Recall that we have the following update rule , where are drawn uniformly from , and . Observe that
Taking expectation with respect to and given , we have
where the second inequality is due to the convexity of , and Lemma 4. It implies that
Now taking the full expectation and summing up over iterations, we have
where . Let , we have
According to the definition of and the convexity of , we obtain
Let and , we can obtain that
∎
a.3 Utility Guarantee – Nonconvex Optimization
To prove the utility guarantee for nonconvex optimization, we need the following lemma, which shows that the Laplacian smoothing operator compresses the norms of any given Gaussian random vector with a specific ratio in expectation.
Lemma 5.
Let be the standard Gaussian random vector. Then
Proof of Lemma 5.
Let the eigenvalue decomposition of be , where is a diagonal matrix with We have
∎
Proof of Theorem 3.
Recall that we have the following update rule , where are drawn uniformly from , and . Since is smooth, we have
Taking expectation with respect to and given , we have
where the second inequality uses Lemma 5 and the last inequality is due to . Now taking the full expectation and summing up over iterations, we have
If we choose fix step size, i.e., , and rearranging the above inequality, and using , we get