DP-LSSGD: A Stochastic Optimization Method to Lift the Utility in Privacy-Preserving ERM

# DP-LSSGD: A Stochastic Optimization Method to Lift the Utility in Privacy-Preserving ERM

Bao Wang
Department of Mathematics
University of California, Los Angeles
wangbaonj@gmail.com
Quanquan Gu
Department of Computer Science
University of California, Los Angeles
qgu@cs.ucla.edu
March Boedihardjo
Department of Mathematics
University of California, Los Angeles
march@math.ucla.edu
Farzin Barekat
Department of Mathematics
University of California, Los Angeles
fbarekat@math.ucla.edu
Stanley J. Osher
Department of Mathematics
University of California, Los Angeles
sjo@math.ucla.edu
###### Abstract

Machine learning (ML) models trained by differentially private stochastic gradient descent (DP-SGD) has much lower utility than the non-private ones. To mitigate this degradation, we propose a DP Laplacian smoothing SGD (DP-LSSGD) for privacy-preserving ML. At the core of DP-LSSGD is the Laplace smoothing operator, which smooths out the Gaussian noise vector used in the Gaussian mechanism. Under the same amount of noise used in the Gaussian mechanism, DP-LSSGD attains the same differential privacy guarantee, but a strictly better utility guarantee, excluding an intrinsic term which is usually dominated by the other terms, for convex optimization than DP-SGD by a factor which is much less than one. In practice, DP-LSSGD makes training both convex and nonconvex ML models more efficient and enables the trained models to generalize better. For ResNet20, under the same strong differential privacy guarantee, DP-LSSGD can lift the testing accuracy of the trained private model by more than % compared with DP-SGD. The proposed algorithm is simple to implement and the extra computational complexity and memory overhead compared with DP-SGD are negligible. DP-LSSGD is applicable to train a large variety of ML models, including deep neural nets. The code is available at https://github.com/BaoWangMath/DP-LSSGD.

## 1 Introduction

Many released machine learning (ML) models are trained on sensitive data that are often crowdsourced or contains personal private information [42, 14, 25]. With a large number of parameters, deep neural nets (DNNs) can memorize the sensitive training data, and it is possible to recover the sensitive data and break the privacy by attacking the released models [33]. For example, Fredrikson et al. demonstrated a model-inversion attack can recover training images from a facial recognition system [15]. Protecting the privacy of sensitive training data is one of the most critical tasks in ML.

Differential privacy (DP) [11, 10] is a theoretically rigorous tool for designing algorithms on aggregated databases with a privacy guarantee. The basic idea is to add a certain amount of noise to randomize the output of a given algorithm such that the attackers cannot distinguish outputs of any two adjacent input datasets that differ in only one entry. Two types of noises are typically injected to the algorithm for DP guarantee: Laplace noise and Gaussian noise [11].

For repeated applications of additive noise based mechanisms, many tools are invented to analyze the DP guarantee for the model obtained at the final stage. These include the basic composition theorem [9, 8], the strong composition theorem and their refinements [13, 23], the momentum-accountant [1], etc. Beyond the original definition of DP, there are also many other ways to define the privacy, e.g., local DP [7], concentrated/zero-concentrated DP [12, 4], and Rényi-DP (RDP) [26].

Differentially private stochastic gradient descent (DP-SGD) reduces the utility of the trained model severely compared with SGD. As shown in Fig. 1, the training and validation loss of the logistic regression increase when the DP guarantee becomes stronger. The ResNet20 trained by DP-SGD has much lower testing accuracy than non-private ResNet20 on the CIFAR10. A natural question is:

Can we improve DP-SGD, with negligible extra computational complexity and memory cost, such that it can be used to train general ML models with better utility?

We answer the above question affirmatively by proposing differentially private Laplacian smoothing SGD (DP-LSSGD). It gives both theoretical and empirical advantages compared with DP-SGD.

### 1.1 Our Contributions

The main contributions of our work are highlighted as follows:

• We propose DP-LSSGD and prove its privacy and utility guarantees for convex/nonconvex optimizations. We prove that under the same privacy budget, DP-LSSGD achieves better utility, excluding an intrinsic term that usually dominated by the other terms, than DP-SGD by a factor that is much less than one for convex optimization.

• We perform a large number of experiments on logistic regression, SVM, and ResNet to verify the utility improvement by using DP-LSSGD. These results show that DP-LSSGD remarkably reduces training and validation loss and improves the generalization of the trained private models.

In Table 1, we compare the privacy and utility guarantees of DP-LSSGD and DP-SGD. For the utility, the notation hides the same constant and log factors for each bound. The constants and denote the dimension of the model’s parameters and the number of training points, respectively. The numbers and are positive constants that are strictly less than one, and are positive constants, which will be defined later.

There is a massive volume of research over the past decade on designing algorithms for privacy-preserving ML. Objective perturbation, output perturbation, and gradient perturbation are the three major approaches to perform empirical risk minimization (ERM) with DP guarantee. We discuss some related works in this part. There are many more exciting works that cannot be discussed here.

Chaudhuri et al. considered both output and objective perturbations for privacy-preserving ERM, and gave theoretical guarantees for both privacy and utility for logistic regression and SVM [5, 6]. Song et al. numerically studied the effects of learning rate and batch size in DP-ERM [34]. Wang et al. studied stability, learnability and other properties of DP-ERM [39]. Lee et al. proposed an adaptive per-iteration privacy budget in concentrated DP gradient descent [24]. Variance reduction techniques, e.g., SVRG, have also been introduced to DP-ERM [37]. The utility bound of DP-SGD has also been analyzed for both convex and nonconvex smooth objectives [3, 43]. Jayaraman et al. analyzed the excess empirical risk of DP-ERM under the distributed setting [21]. Besides ERM, many other ML models have been made differentially private. These include: clustering [35, 41, 2], matrix completion [20], online learning [19], sparse learning [36, 38], and topic modeling [30]. Gilbert et al. exploited the ill-conditionedness of inverse problems to design algorithms to release differentially private measurements of the physical system [17].

Shokri et al. proposed distributed selective SGD to train deep neural nets (DNNs) with a DP guarantee in a distributed system; they achieved quite a successful trade-off between privacy and utility [32]. Abadi et al. considered applying DP-SGD to train DNNs in a centralized setting. They clipped the gradient to bound the sensitivity and invented the momentum accountant to get better privacy loss estimation [1]. Papernot et al. proposed Private Aggregation of Teacher Ensembles/PATE based on the semi-supervised transfer learning to train DNNs and to protect the privacy of the private data [28]. Recently Papernot et al. introduced new noisy aggregation mechanisms for teacher ensembles that enable a tighter theoretical DP guarantee. The modified PATE is scalable to the large dataset and applicable to more diversified ML tasks [29]. Geyer et al. considered general ML with a DP guarantee under federated settings [16]. Rahman et al. numerically studied the vulnerability and privacy-utility trade-off of DNNs trained with a DP guarantee to adversarial attacks [31].

### 1.3 Notation

We use boldface upper-case letters , to denote matrices and boldface lower-case letters , to denote vectors. For vectors and and positive definite matrix , we use and to denote the -norm and the induced norm by , respectively; denotes the inner product of and ; and denotes the -th largest eigenvalue of . We denote the set of numbers from to by .

### 1.4 Organization

This paper is organized in the following way: In Section 2, we introduce the DP-LSSGD algorithm, which merely injects an appropriate Gaussian noise to guarantee the privacy of LSSGD. In Section 3, we analyze the privacy and utility guarantees of DP-LSSGD for both convex and nonconvex optimizations. We numerically verify the efficiency of DP-LSSGD in Section 4. We conclude this work and point out some future directions in Section 5.

## 2 Problem Setup and Algorithm

### 2.1 Laplacian Smoothing Stochastic Gradient Descent (LSSGD)

Consider the following finite-sum optimization

 minwF(w):=1nn∑i=1fi(w), (1)

where is the loss of a given ML model on the training data . This finite-sum optimization problem is the mathematical formulation for training many ML models that are mentioned above. The LSSGD [27] for solving this finite-sum optimization is given by

 wk+1=wk−ηA−1σ∇fik(wk), (2)

where is the learning rate, and is a random sample from . Let where and are the identity and the discrete one-dimensional Laplacian matrix, respectively. Therefore,

 Aσ:=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1+2σ−σ0…0−σ−σ1+2σ−σ…000−σ1+2σ…00………………−σ00…−σ1+2σ⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (3)

for being a constant. When , LSSGD reduces to SGD.

This Laplacian smoothing can help to avoid spurious minima, reduce the variance of SGD on-the-fly, and lead to better generalization in training many ML models including DNNs. Computationally, we use the fast Fourier transform (FFT) to perform gradient smoothing in the following way

 A−1σv=ifft(fft(v)1−σ⋅fft(d)),

where is any stochastic gradient vector and .

### 2.2 Dp-Lssgd

We propose the following DP-LSSGD algorithm to resolve the finite-sum optimization in Eq. (1)

 wk+1=wk−ηA−1σ(∇fik(wk)+n), (4)

where denotes the gradient of the total loss function evaluated from the database and is the injected Gaussian noise. In this scheme, we first add the noise to the stochastic gradient vector , and then apply the operator to smooth the noisy stochastic gradient on-the-fly. We assume that each component function in Eq. (1) is -Lipschitz. The DP-LSSGD algorithm for finite-sum optimization is summarized in Algorithm 1.

## 3 Main Theory

In this section, we present the privacy and utility guarantees for DP-LSSGD. The technical proofs are provided in the appendix.

###### Definition 1 ((ϵ,δ)-Dp).

([11]) A randomized mechanism satisfies -differential privacy if for any two adjacent data sets differing by one element, and any output subset , it holds that

###### Theorem 1 (Privacy Guarantee).

Suppose that each component function is -Lipschitz. Given the total number of iterations , for any and privacy budget , DP-LSSGD, with injected Gaussian noise for each coordinate, satisfies -differential privacy with , where .

###### Remark 1.

It is straightforward to show that the noise in Theorem 1 is in fact also tight to guarantee the -differential privacy for DP-SGD, since the same amount of Gaussian noise guarantees the same differential privacy for both DP-SGD and DP-LSSGD.

For convex ERM, DP-LSSGD guarantees the following utility bound in terms of the gap between the ergodic average of the points along the DP-LSSGD path and the optimal solution .

###### Theorem 2 (Utility Guarantee for convex optimization).

Suppose is convex and each component function is -Lipschitz. Given any , if we choose and , where and is the global minimizer of , the DP-LSSGD output satisfies the following utility

where .

###### Proposition 1.

In Theorem 2, where

###### Remark 2.

Compared with the extra utility bound of DP-SGD , DP-LSSGD has a strictly better extra utility bound by a factor of , except for the term . In practice, for both logistic regression and SVM, is dominated by , and DP-LSSGD improves the utility of both models.

For nonconvex ERM, DP-LSSGD has the following utility bound measured in gradient norm.

###### Theorem 3 (Utility Guarantee for nonconvex optimization).

Suppose that is nonconvex and each component function is -Lipschitz and has -Lipschitz continuous gradient. Given any , if we choose and , where with being the global minimum of , then the DP-LSSGD output satisfies the following utility

 E∥∇F(~w)∥2A−1σ≤4G√6βdL(2DF+LG2)log(1/δ)nϵ,

where .

###### Proposition 2.

In Theorem 3, where and

The number is also strictly between and and . It is worth noting that if we use the -norm instead of the induced norm, we have the following utility guarantee

 E∥∇F(~w)∥22 ≤E∥∇F(~w)∥2A−1σλmin(A−1σ)≤(1+4σ)E∥∇F(~w)∥2A−1σ≤4ζG√6dL(2DF+LG2)log(1/δ)nϵ

where . In the -norm, DP-LSSGD has a bigger utility upper bound than DP-SGD (set in ). However, this does not mean that DP-LSSGD has worse performance. To see this point, let us consider the following simple nonconvex function

 f(x,y)=⎧⎨⎩x24+y2,for x24+y2≤1sin(π2(x24+y2)),for x24+y2>1 (5)

For two points and , the distance to the local minima are and , while and . So is closer to the local minima than while its gradient has a larger -norm. This example shows is not the optimal measure in comparing the utility bound for nonconvex optimization. We will further verify this in Section 4.

## 4 Numerical Results

In this section, we verify the efficiency of the proposed DP-LSSGD in training multi-class logistic regression, SVM, and ResNet20. We perform ResNet20 experiments on the CIFAR10 dataset with standard data augmentation [18], logistic regression and SVM experiments on the benchmark MNIST classification. Based on the range of gradient values of each model, we use the formula [1] to clip the gradient -norms of logistic regression and ResNet20 to and clip the SVM’s gradient -norm to . These gradient clippings guarantee the Lipschitz condition for the objective functions. For all experiments, we train both logistic regression and SVM with -DP guarantee, and ResNet20 with -DP guarantee. We regard the DP-SGD as the benchmark.

### 4.1 Multi-class Logistic Regression and SVM

For MNIST classification, we ran 50 epochs of DP-LSSGD with learning rate scheduled as with being the index of the iteration to train the -regularized multi-class logistic regression and SVM (the objective function of both models are strongly convex), using an penalty with regularization coefficient . We split the training data into 50K/10K for cross-validation. The models with best validation accuracy are used for testing. The batch size is set to .

First, we show that DP-LSSGD converges faster than DP-SGD and makes the training and validation loss much smaller than DP-SGD. We plot the evolution of training and validation loss over iterations for logistic regression (Fig. 2) and SVM (Fig. 3) with DP guarantee. Figures 2 and 3 show that the training loss curve of DP-SGD () is much higher and more oscillatory (due to the log-scale in -axis) than that of DP-LSSGD (). The validation loss of both logistic regression and SVM trained by both DP-SGD and DP-LSSGD decrease as iteration goes. The validation loss of the model trained by DP-LSSGD decays faster and has a much smaller loss value than that of the model trained by DP-SGD. For both training and validation, DP-LSSGD with gives better results than .

Second, consider the validation accuracy of the models trained by DP-SGD and DP-LSSGD. Figure 4 depicts the evolution of the validation accuracy of the trained logistic regression and SVM by DP-SGD and DP-LSSGD. We plot validation accuracy after every training epoch. It shows that DP-LSSGD is almost always better than DP-SGD in the sense that DP-LSSGD gives better validation accuracy. Different in DP-LSSGD give different level of improvement. For these experiments, larger is usually better than the smaller one.

Third, consider the testing accuracy of logistic regression and SVM trained in different scenarios. The corresponding testing accuracy are listed in Tables. 2, and 3. All the numbers reported in the above tables and the tables below are the results averaged over three independent experiments. These results reveal that the multi-class logistic regression model is remarkably more accurate than SVM for various levels of DP-guarantee. Both logistic regression and SVM trained by DP-LSSGD with are more accurate than that trained by DP-SGD over different levels of DP-guarantee.

#### 4.1.1 The Choice of σ

Table 5 lists the testing accuracy (averaged over three runs) of both private logistic regression and SVM trained by DP-LSSGD with different . It shows that accuracy improvement is stable to . As increases, the testing accuracy increases initially and then decays. In practice, DP-LSSGD is as fast as DP-SGD, so for a given objective function we might try a few different to find the optimal one.

### 4.2 Deep Learning

We run 100 epochs of DP-LSSGD with batch size 128 to train ResNet20 on the CIFAR10. To justify our theoretical results, we apply DP-LSSGD without momentum, and no weight decay is used during the training. It is known that Nesterov momentum and weight decay, i.e., the regularization, are helpful to accelerate the convergence and improve the generalization of the trained model. In our future work, we will integrate these techniques into DP-LSSGD. We split the training data into 45K/5K for crosss validation. During training, we decay the learning rate by a factor of 10 at the 40th and 80th epoch, respectively. Figure 5 shows the evolution of epoch v.s. training (Fig. 5(a)) and validation losses (Fig. 5(b)) of ResNet20 trained by DP-LSSGD with different Laplacian smoothing parameters , and with -DP guarantee. We conclude from these two plots that: (i) learning rate decay is still very helpful for DP-LSSGD in training DNNs which is well-known in SGD, as we see that there is a sharp training and validation loss decay at the 40th epoch; (ii) the Laplacian smoothing can reduce both the training and validation losses significantly.

We plot the evolution of epoch v.s. validation accuracy in Fig. 5 (c) which is generally consistent with the evolution of epoch v.s. validation loss. In Fig. 5 (d) we plot the testing accuracy of the trained model by DP-LSSGD with different Laplacian smoothing parameters and different with fixed , where the corresponding values of the testing accuracy are listed in Table 4. DP-LSSGD can improve testing accuracy up to % when the strong DP is guaranteed. The accuracy improvement is much more significant than that of convex optimization scenario.

DP-LSSGD is a complement to the privacy mechanisms proposed in [1] and [29]. In future work, we will integrate DP-LSSGD into the algorithms proposed in [1] and [29] to further boost private model’s utility.

#### 4.2.1 Is Gradient Norm the Right Metric for Measuring Utility?

In Section 3 we gave a simple nonconvex function and showed that a point having a smaller gradient -norm does not indicate it is closer to the local minima. Now, we will show experimentally that for ResNet20, a smaller gradient norm does not indicate more proximity to the local minima. Figure 6 depicts the epoch (k) v.s. (a), validation accuracy (b), training (c) and validation (d) losses. These plots show that during evolution, though DP-LSSGD has a larger gradient norm than DP-SGD, it has much better utility in terms of validation accuracy, and training and validation losses.

## 5 Conclusions

In this paper, we proposed a new differentially private stochastic optimization algorithm, DP-LSSGD, inspired by the recently proposed LSSGD. The algorithm is simple to implement and the extra computational cost compared with the DP-SGD is almost negligible. We show that DP-LSSGD can lift the utility of the trained private ML models both numerically and theoretically. It is straightforward to combine LS with other variance reduction technique, e.g., SVRG [22].

## Appendix A Proof of the Main Theorems

### a.1 Privacy Guarantee

To prove the privacy guarantee in Theorem 1, we first introduce the following -sensitivity.

###### Definition 2 (ℓ2-Sensitivity).

For any given function , the -sensitivity of is defined by

 Δ(f)=max∥S−S′∥1=1∥f(S)−f(S′)∥2,

where means the data sets and differ in only one entry.

We will adapt the concepts and techniques of Rényi differential privacy (RDP) to prove the DP-guarantee of the proposed DP-LSSGD.

###### Definition 3 (Rdp).

For , a randomized mechanism satisfies -Rényi differential privacy, i.e., -RDP, if for all adjacent datasets differing by one element, we have

where the expectation is taken over .

###### Lemma 1.

[40] Given a function , the Gaussian Mechanism , where , satisfies -RDP. In addition, if we apply the mechanism to a subset of samples using uniform sampling without replacement, satisfies -RDP when , with denoting the subsample rate.

###### Lemma 2.

[26] If randomized mechanisms , for , satisfy -RDP, then their composition satisfies -RDP. Moreover, the input of the -th mechanism can be based on outputs of the previous mechanisms.

###### Lemma 3.

If a randomized mechanism satisfies -RDP, then satisfies -DP for all .

With the definition (Def. 3) and guarantees of RDP (Lemmas 1 and 2), and the connection between RDP and -DP (Lemma 3), we can prove the following DP-guarantee for DP-LSSGD.

###### Proof of Theorem 1.

Let us denote the update of DP-SGD and DP-LSSGD at the -th iteration starting from any given points and , respectively, as

 wk+1=wk−ηk(∇fik(wk)+n), (6)

and

 ~wk+1=~wk−ηkA−1σ(∇fik(~wk)+n), (7)

where are drawn uniformly from .

We will show that with the aforementioned Gaussian noise for each coordinate of , the output of DP-SGD, , after iterations is -DP. Let us consider the mechanism with the query . We have the -sensitivity of as . According to Lemma 1, if we add noise with variance

 ν2=Tα(α−1)Δ2(qk)log(1/δ)=4Tα(α−1)G2n2log(1/δ),

the mechanism will satisfy -RDP. By post-processing theorem, we immediately have that under the same noise, also satisfies -RDP. According to Lemma 1, will satisfy -RDP provided that . Let , we obtain that satisfies -RDP as long as we have

 ν2=4Tα(α−1)G2n2log(1/δ)=4T(2log(1/δ)+ϵ)2log(1/δ)G2n2log(1/δ)ϵ2≥11.25,

which implies that

 ϵ2≤20TG2log(1/δ)n2.

Therefore, according to Lemma 2, we have satisfies -RDP. Finally, by Lemma 3, we have satisfies -DP. Therefore, the output of DP-SGD, , is -DP. ∎

###### Remark 3.

In the above proof, we used the following estimate of the sensitivity

 Δ(qk)=∥A−1σ∇fi(wk)−A−1σ∇fi′(wk)∥2/n≤2G/n.

Indeed, let and , then according to [27] we have

 ∥d∥2+2σ∥D+d∥22d+σ2∥Ld∥22d=∥g∥2,

where is the dimension of , and

 D+=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣−110…000−11…0000−1…00………………100…0−1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

Moreover, if we assume the is randomly sampled from a unit ball in a high dimensional space, then a high probability estimation of the compression ratio of the norm can be derived from Lemma. 5.

Numerical experiments show that is much less than , so for the above noise, it can give much stronger privacy guarantee.

### a.2 Utility Guarantee – Convex Optimization

To prove the utility guarantee for convex optimization, we first show that the Laplacian smoothing operator compresses the norm of any given Gaussian random vector with a specific ratio in expectation.

###### Lemma 4.

Let be the standard Gaussian random vector. Then

 E∥x∥2A−1σ=d∑i=111+2σ−2σcos(2πi/d),

where is the square of the induced norm of by the matrix .

###### Proof of Lemma 4.

Let the eigenvalue decomposition of be , where is a diagonal matrix with We have

 E∥x∥2A−1σ =E[Tr(x⊤UΛU⊤x)] =d∑i=1Λii =d∑i=111+2σ−2σcos(2πi/d).

###### Proof of Theorem 2.

Recall that we have the following update rule , where are drawn uniformly from , and . Observe that

 ∥wk+1−w∗∥2Aσ =∥wk−ηkA−1σ(∇fik(wk)+n)−w∗∥2Aσ =∥wk−w∗∥2Aσ+η2k(∥A−1σ∇fik(wk)∥2Aσ+∥A−1σn∥2Aσ+2⟨A−1σ∇fik(wk),n⟩) −2ηk⟨∇fik(wk)+n,wk−w∗⟩.

Taking expectation with respect to and given , we have

 E∥wk+1−w∗∥2Aσ =E∥wk−w∗∥2Aσ−2ηkE⟨∇F(wk),wk−w∗⟩+η2kE∥∇fik(wk)∥2A−1σ+η2kE∥n∥2A−1σ ≤E∥wk−w∗∥2Aσ−2ηkE(F(wk)−F(w∗))+η2k(G2+γdν2),

where the second inequality is due to the convexity of , and Lemma 4. It implies that

 2ηkE(F(wk)−F(w∗))≤(E∥wk−w∗∥2Aσ−E∥wk+1−w∗∥2Aσ)+η2k(G2+γdν2).

Now taking the full expectation and summing up over iterations, we have

 T−1∑k=02ηkE(F(wk)−F(w∗)) ≤Dσ+T−1∑k=0η2k(G2+γdν2),

where . Let , we have

 T−1∑k=0vkE(F(wk)−F(w∗)) ≤Dσ+∑T−1k=0η2k(G2+γdν2)2∑T−1k=0ηk.

According to the definition of and the convexity of , we obtain

 E(F(~w)−F(w∗)) ≤Dσ+∑T−1k=0η2k(G2+γdν2)2∑T−1k=0ηk ≤Dσ+∑T−1k=0η2kG22∑T−1k=0ηk+∑T−1k=0η2k2∑T−1k=0ηk⋅24γdTG2log(1/δ)n2ϵ2.

Let and , we can obtain that

### a.3 Utility Guarantee – Nonconvex Optimization

To prove the utility guarantee for nonconvex optimization, we need the following lemma, which shows that the Laplacian smoothing operator compresses the norms of any given Gaussian random vector with a specific ratio in expectation.

###### Lemma 5.

Let be the standard Gaussian random vector. Then

 E∥A−1σx∥22=d∑i=11(1+2σ−2σcos(2πi/d))2.
###### Proof of Lemma 5.

Let the eigenvalue decomposition of be , where is a diagonal matrix with We have

 E∥A−1σx∥22 =E[Tr(x⊤UΛU⊤UΛU⊤x)] =E[Tr(x⊤UΛ2U⊤x)] =d∑i=1Λ2ii =d∑i=11(1+2σ−2σcos(2πi/d))2.

###### Proof of Theorem 3.

Recall that we have the following update rule , where are drawn uniformly from , and . Since is -smooth, we have

 F(wk+1) ≤F(wk)+⟨∇F(wk),wk+1−wk⟩+L2∥wk+1−wk∥22 =F(wk)−ηk⟨∇F(wk),A−1σ(∇fik(wk)+n)⟩ +η2kL2(∥A−1σ∇fik(wk)∥22+∥A−1σn∥22+2⟨A−1σ∇fik(wk),A−1σn⟩).

Taking expectation with respect to and given , we have

 EF(wk+1) ≤EF(wk)−ηk(1−ηkL2)E∥∇F(wk)∥2A−1σ+η2kL2(G2+dβν2) ≤EF(wk)−ηk2E∥∇F(wk)∥2A−1σ+η2kL(G2+dβν2)2,

where the second inequality uses Lemma 5 and the last inequality is due to . Now taking the full expectation and summing up over iterations, we have

 EF(wT)≤F(w0)−T−1∑k=1ηk2E∥∇F(wk)∥2A−1σ+T−1∑k=1η2kL(G2+dβν2)2.

If we choose fix step size, i.e., , and rearranging the above inequality, and using