Asynchronous Stochastic Gradient Descent with Delay Compensation

# Supplementary Material: Asynchronous Stochastic Gradient Descent with Delay Compensation

## Abstract

With the fast development of deep learning, it has become common to learn big neural networks using massive training data. Asynchronous Stochastic Gradient Descent (ASGD) is widely adopted to fulfill this task for its efficiency, which is, however, known to suffer from the problem of delayed gradients. That is, when a local worker adds its gradient to the global model, the global model may have been updated by other workers and this gradient becomes “delayed”. We propose a novel technology to compensate this delay, so as to make the optimization behavior of ASGD closer to that of sequential SGD. This is achieved by leveraging Taylor expansion of the gradient function and efficient approximation to the Hessian matrix of the loss function. We call the new algorithm Delay Compensated ASGD (DC-ASGD). We evaluated the proposed algorithm on CIFAR-10 and ImageNet datasets, and the experimental results demonstrate that DC-ASGD outperforms both synchronous SGD and asynchronous SGD, and nearly approaches the performance of sequential SGD.

\printAffiliationsAndNotice

## 1 Introduction

Deep Neural Networks (DNN) have pushed the frontiers of many applications, such as speech recognition Sak et al. (2014); Sercu et al. (2016), computer vision Krizhevsky et al. (2012); He et al. (2016); Szegedy et al. (2016), and natural language processing Mikolov et al. (2013); Bahdanau et al. (2014); Gehring et al. (2017). Part of the success of DNN should be attributed to the availability of big training data and powerful computational resources, which allow people to learn very deep and big DNN models in parallel Zhang et al. (2015); Chen & Huo (2016); Chen et al. (2016).

In this paper, we propose a novel method, called Delay Compensated ASGD (or DC-ASGD for short), to tackle the problem of delayed gradients. For this purpose, we study the Taylor expansion of the gradient function at . We find that the delayed gradient is just the zero-order approximator of the correct gradient , and we can leverage more items in the Taylor expansion to achieve more accurate approximation of . However, this straightforward idea is practically non-trivial, because even including the first-order derivative of the gradient will require the computation of the second-order derivative of the original loss function (i.e., the Hessian matrix), which will introduce high computation and space complexity. To overcome this challenge, we propose a cheap yet effective approximator of the Hessian matrix, which can achieve a good trade-off between bias and variance of approximation, only based on previously available gradients (without the necessity of directly computing the Hessian matrix).

DC-ASGD is similar to ASGD in the sense that no worker needs to wait for others. It differs from ASGD in that it does not directly add the local gradient to the global model, but compensates the delay in the local gradient by using the approximate Taylor expansion. By doing so, it maintains almost the same efficiency as ASGD and achieves much higher accuracy. Theoretically, we proved that DC-ASGD can converge at a rate of the same order with sequential SGD for non-convex neural networks, if the delay is upper bounded; and it is more tolerant on the delay than ASGD2. Empirically, we conducted experiments on both CIFAR-10 and ImageNet datasets. The results show that (1) as compared to SSGD and ASGD, DC-ASGD accelerated the convergence of the training process; (2) the accuracy of the model obtained by DC-ASGD within the same time period is very close to the accuracy obtained by sequential SGD.

## 2 Problem Setting

In this section, we introduce DNN and its parallel training through ASGD.

Given a multi-class classification problem, we denote as the input space, as the output space, and as the joint distribution over . Here denotes the dimension of the input space, and denotes the number of categories in the output space.

We have a training set , whose elements are i.i.d. sampled from according to distribution . Our goal is to learn a neural network model parameterized by based on the training set. Specifically, the neural network models have hierarchical structures, in which each node conducts linear combination and non-linear activation over its connected nodes in the lower layer. The parameters are the weights on the edges between two layers. The neural network model produces an output vector, i.e., for each input , indicating its likelihoods of belonging to different categories. Because the underlying distribution is unknown, a common way of learning the model is to minimize the empirical loss function. A widely-used loss function for deep neural networks is the cross-entropy loss, which is defined as follows,

 f(x,y;w)=−K∑k=1(I[y=k]logσk(x;w)). (1)

Here is the Softmax operator. The objective is to optimize the empirical risk, defined as below,

 F(w)=1SS∑s=1fs(w):=1SS∑s=1f(xs,ys;w). (2)

As mentioned in the introduction, ASGD is a widely-used approach to perform parallel training of neural networks. Although ASGD is highly efficient, it is well known to suffer from the problem of delayed gradient. To better illustrate this problem, let us have a close look at the training process of ASGD as shown in Figure 1. According to the figure, local worker starts from , the snapshot of the global model at time , calculates the local gradient , and then add this gradient back to the global model3. However, before this happens, some other workers may have already added their local gradients to the global model, the global model has been updated times and becomes . The ASGD algorithm is blind to this situation, and simply adds the gradient to the global model , as follows.

 wt+τ+1=wt+τ−ηg(wt), (3)

where is the learning rate.

It is clear that the above update rule of ASGD is problematic (and inequivalent to that of sequential SGD): one actually adds a “delayed” gradient to the current global model . In contrast, the correct way is to update the global model based on the gradient w.r.t. . This problem of delayed gradient has been well known Agarwal & Duchi (2011); Recht et al. (2011); Lian et al. (2015); Avron et al. (2015), and many practical observations indicate that it usually costs ASGD more iterations to converge than sequential SGD, and sometimes, the converged model of ASGD cannot reach accuracy parity of sequential SGD, especially when the number of workers is large Dean et al. (2012); Ho et al. (2013); Zhang et al. (2015). Researchers have tried to improve ASGD from different perspectives Ho et al. (2013); McMahan & Streeter (2014); Zhang et al. (2015); Sra et al. (2015); Mitliagkas et al. (2016), however, to the best of our knowledge, there is still no solution that can compensate the delayed gradient while keeping the high efficiency of ASGD. This is exactly the motivation of our paper.

## 3 Delay Compensation using Taylor Expansion and Hessian Approximation

As explained in the previous sections, ideally, the optimization algorithm should add gradient to the global model , however, ASGD adds a delayed version . In this section, we propose a novel method to bridge this gap by using Taylor expansion and Hessian approximation.

### 3.1 Gradient Decomposition using Taylor Expansion

The Taylor expansion of the gradient function at can be written as follows Folland (2005),

 g(wt+τ)=g(wt)+∇g(wt)(wt+τ−wt)+O((wt+τ−wt)2)In, (4)

where denotes the matrix with the element for and , with and and is a -dimension vector with all the elements equal to .

By comparing the above formula with Eqn. (3), we can immediately find that ASGD actually uses the zero-order item in Taylor expansion as its approximation to , and totally ignores all the higher-order terms . This is exactly the root cause of the problem of delayed gradient. With this insight, a straightforward and ideal method is to use the full Taylor expansion to compensate the delay. However, this is practically intractable, since it involves the sum of an infinite number of items. And even the simplest delay compensation, i.e., additionally keeping the first-order item in the Taylor expansion (which is shown below), is highly non-trivial,

 g(wt+τ)≈g(wt)+∇g(wt)(wt+τ−wt). (5)

This is because the first-order derivative of the gradient function corresponds to the Hessian matrix of the original loss function (e.g., cross entropy for neural networks), which is defined as where .

For a neural network model with millions of parameters (which is very common and may only be regarded as a medium-size network today), the corresponding Hessian matrix will contain trillions of elements. It is clearly very computationally and spatially expensive to obtain such a large matrix4. Fortunately, as shown in the next subsection, we find an easy-to-compute/store approximator to the Hessian matrix, which makes our proposal of delay compensation technically feasible.

### 3.2 Approximation of Hessian Matrix

Computing the exact Hessian matrix is computationally and spatially expensive, especially for large models. Alternatively, we want to find some approximators that are theoretically close to the Hessian matrix, but can be easily stored and computed without introducing additional complexity (i.e., just using what we already have during the previous training process).

First, we show that the outer product of the gradients is an asymptotically unbiased estimation of the Hessian matrix. Let us use to denote the outer product matrix of the gradient at , i.e.,

 G(wt)=(∂∂wf(x,y,wt))(∂∂wf(x,y,wt))T. (6)

Because the cross entropy loss is a negative log-likelihood with respect to the Softmax distribution of the model, i.e., , it is not difficult to obtain that the outer product of the gradient is an asymptotically unbiased estimation of Hessian, according to the two equivalent methods to calculate the fisher information matrix Friedman et al. (2001)5:

 ϵt≜E(y|x,w∗)||G(wt)−H(wt)||→0,t→∞. (7)

The assumption behind the above equivalence is that the underlying distribution equals the model distribution with parameter (or there is no approximation error of the NN hypothesis space) and the training model gradually converges to the optimal model along with the training process. This assumption is reasonable considering the universal approximation property of DNN Hornik (1991) and the recent results on the optimality of the local optima of DNN Choromanska et al. (2015); Kawaguchi (2016).

Second, we show that by further introducing a well-designed weight to the outer product of the gradients, we can achieve a better trade-off between bias and variance for the approximation.

Although the outer product of the gradients can achieve unbiased estimation to the Hessian matrix, it may induce high approximation error due to potentially large variance. To further control the variance, we use mean square error (MSE) to measure the quality of an approximator, which is defined as follows,

 mset(G)=E(y|x,w∗)∥(G(wt)−H(wt))||2. (8)

We consider the following new approximator , and prove that with appropriately set , can lead to smaller MSE than , for arbitrary model during the training.

###### Theorem 3.1

Assume that the loss function is -Lipschitz, and for arbitrary , , . If makes the following inequality holds,

 K∑k=11σ3k(x,wt)≥2C⎡⎣(K∑k=11σk(x,wt))2+2L21ϵt⎤⎦, (9)

where , and the model converges to the optimal model , then .

The following corollary gives simpler sufficient conditions for Theorem 3.1.

###### Corollary 3.2

A sufficient condition for inequality (9) is such that .

According to Corollary 3.2, we have the following discussions. Please note that, if converges to , is a decreasing term and approaches . Thus, can be upper bounded by a very small constant for large . Therefore, the condition on is more likely to be satisfied when () is close to . Please note that this is not a strong condition, since if () is very small, the classification power of the corresponding neural network model will be very weak and not useful in practice.

Third, to reduce the storage of the approximator , we adopt a widely-used diagonalization trick Becker et al. (1988), which has shown promising empirical results. To be specific, we only store the diagonal elements of the approximator and make all the other elements to be zero. We denote the refined approximator as and assume that the diagonalization error is upper bounded by , i.e., . We give a uniform upper bound of its MSE in the supplementary materials, from which we can see that plays a role of trading off variance and Lipschitz6.

## 4 Delay Compensated ASGD: Algorithm Description

In Section 3, we have shown that is a cheap approximator of the Hessian matrix, with guaranteed approximation accuracy. In this section, we will use this approximator to compensate the gradient delay, and call the corresponding algorithm Delay-Compensated ASGD (DC-ASGD). Since , where indicates the element-wise product, the update rule for DC-ASGD can be written as follows:

 wt+τ+1=wt+τ−η(g(wt)+λg(wt)⊙g(wt)⊙(wt+τ−wt)), (10)

We call the delay-compensated gradient for ease of reference.

The flow of DC-ASGD is shown in Algorithms 1 and 2. Here we assume that DC-ASGD is implemented by using the parameter server framework (although it can also be implemented in other frameworks). According to Algorithm 1, local worker pulls the latest global model from the parameter server, computes its gradient and sends it back to the server. According to Algorithm 2, the parameter server will store a backup model when worker pulls . When the delayed gradient calculated by worker is received at time , the parameter server updates the global model according to Eqn (10).

Please note that as compared to ASGD, DC-ASGD has no extra communication cost and no extra computational requirement on the local workers. And the additional computations regarding Eqn(10) only introduce a lightweight overhead to the parameter server. As for the space requirement, for each worker , the parameter server needs to additionally store a backup model . This is not a critical issue since the parameter server is usually implemented in a distributed manner, and the parameters and its backup version are stored in CPU-side memory which is usually far beyond the total parameter size. In this case, the cost of DC-ASGD is quite similar to ASGD, which is also reflected by our experiments.

The Delay Compensation is not only applicable to ASGD but SSGD. Recently a study on SSGDGoyal et al. (2017) assumes for to make the updates from small and large mini-batch SGD similar, which can be immediately improved by applying delay-compensated gradient. Please check the detailed discussion in Supplementary.

## 5 Convergence Analysis

In this section, we prove the convergence rate of DC-ASGD. Due to space restrictions, we only give the results for the non-convex case, and leave the results for the convex case (which is much easier) to the supplementary.

In order to present our main theorem, we need to introduce the following mild assumptions.

Assumption 1 (Smoothness): Lian et al. (2015)Recht et al. (2011) The loss function is smooth w.r.t. the model parameter, and we use to denote the upper bounds of the first, second, and third-order derivatives of the loss function. The activation function is -Lipschitz continuous.

Assumption 2 (Non-convexity): Lee et al. (2016) The loss function is -strongly convex in a ball centered at each local optimum which is denoted as with radius , and twice differential about w.

We also introduce some notations to simplify the presentation of our results, i.e.,

 M=maxk,wloc∣∣P(Y=k|x,wloc)−P(Y=k|x,w∗)∣∣,
 H=maxk,x,w∣∣∣∂2P(Y=k|x,w)∂2w×1P(Y=k|x,w)∣∣∣,∀k∈[K],x,w.

Actually, the non-convexity error , which is defined as the upper bound of the difference between the prediction outputs of the local optima and the global optimum (Please see Lemma 5.1 in the supplementary materials). We assume that the DC-ASGD search in the set and denote , , where , .

With all the above, we have the following theorem.

###### Theorem 5.1

Assume that Assumptions 1-2 hold. Set the learning rate where is the mini-batch size, and is the upper bound of the variance of the delay-compensated gradient. If and delay is upper-bounded as below,

 τ≤min{L2γCλ,γCλ,√Tγ~C,L2Tγ4~C}, (11)

where , then DC-ASGD has the following ergodic convergence rate,

 mint={1,⋯,T}E(∥∇F(wt)∥2)≤V√2D0L2bT, (12)

where is the number of iteration, the expectation is taken with respect to the random sampling in SGD and the data distribution .

Proof Sketch7:

Step 1: We denote the delay-compensated gradient as where is the index of instances in the mini-batch and . According to Assumption 1, we have

 EF(wt+τ+1)−F(wt+τ) (13) ≤ +bηt+τ∥∥ ∥∥∇F(wt+τ)−b∑m=1∇Fh(wt)∥∥ ∥∥2 +bηt+τ∥∥ ∥∥b∑m=1Egdcm(wt)−b∑m=1Fh(wt)∥∥ ∥∥2 +η2t+τL22E⎛⎝∥∥ ∥∥b∑m=1gdcm(wt)∥∥ ∥∥2⎞⎠.

The term , measured by the expectation with respect to , is bounded by . The term can be bounded by , which will be smaller than when is small. Other terms which are related to the gradients can be further upper bounded by the smoothness property of the loss function.

Step 2: We proved that, under the non-convexity assumption, if , then when , , where . That is, we can find a weaker condition for the decreasing of than that for .

Step 3: By plugging in the decreasing rate of in Step 1 and following a similar proof of the convergence rate of ASGD Lian et al. (2015), we can get the result in the theorem.

Discussions:

(1) The above theorem shows that the convergence rate of DC-ASGD is in the order of . Recall that the convergence rate of ASGD is , where is the variance for the delayed gradient . By simple calculation, can be upper bounded by , where is the extra moments of the noise introduced by the delay compensation term. Thus if we set , DC-ASGD and ASGD will converge at the same rate. As the training process goes on, will become smaller. Compared with , (composed by variance of ) will not be the dominant order and can be gradually neglected. As a result, the feasible range for is actually very large.

(2) Although DC-ASGD converges at the same rate with ASGD, its tolerance on the delay is much better if and . The intuition for the condition on is that larger induces smaller step size . A small step size means that and are close to each other. According to the upper bound of Taylor expansion series Folland (2005), we can see that delay compensated gradient will be more accurate than the delayed gradient used in ASGD. Since is related to the diagonalization error and the non-convexity error , smaller and will lead to looser conditions for the convergence. If these two error are sufficiently small (which is usually the case according to Choromanska et al. (2015); Kawaguchi (2016); LeCun (1987)), the condition can be simplified as , which is easy to be satisfied with a small . Assume that , which is easily to be satisfied if the gradient is small (e.g. at the later stage of the training progress). Accordingly, we can obtain the feasible range for as . can be regarded as a trade-off between the extra variance introduced by the delay-compensate term and the bias in Hessian approximation.

(3) Actually ASGD is an extreme case for DC-ASGD, with . Another extreme case is with . DC-ASGD prefers larger and smaller , which can lead to a faster speed-up and larger tolerant for delay.

Based on the above discussions, we have the following corollary, which indicates that DC-ASGD is superior to ASGD in most cases.

###### Corollary 5.2

Let , which is a constant. If we choose and the number of total iterations , DC-ASGD will outperform ASGD by a factor of .

## 6 Experiments

In this section, we evaluate our proposed DC-ASGD algorithm. We used two datasets: CIFAR-10 Hinton (2007) and ImageNet ILSVRC 2013 Russakovsky et al. (2015). The experiments were conducted on a GPU cluster interconnected with InfiniBand. Each node has four K40 Tesla GPU processors. We treat each GPU as a separate local worker. For the DNN algorithm running on each worker, we chose ResNet He et al. (2016) since it produces the state-of-the-art accuracy in many image related tasks and its implementation is available through open-source projects8. For the parallelization of ResNet across machines, we leveraged an open-source parameter server9.

We implemented DC-ASGD on this experimental platform. We have two versions of implementations, one sets as a constant, and the other adaptively tunes using a moving average method proposed by Tieleman & Hinton (2012). Specifically, we first define a quantity called MeanSquare as follows,

 MeanSquare(t)=m⋅MeanSquare(t−1)+(1−m)⋅g(wt)2, (14)

where is a constant taking value from . And then we divide the initial by , where for all our experiments. This adaptive method is adopted to reduce the variance among coordinates with historical gradient values. For ease of reference, we denote the first implementation as DC-ASGD-c (constant) and the second as DC-ASGD-a (adaptive).

In addition to DC-ASGD, we also implemented ASGD and SSGD, which have been used in many previous works as baselines Dean et al. (2012); Chen et al. (2016); Das et al. (2016). Furthermore, for the experiments on CIFAR-10, we used the sequential SGD algorithm as a reference model to examine the accuracy of parallel algorithms. However, for the experiments on ImageNet, we were not able to show this reference because it simply took too long time for a single machine to finish the training10. For sake of fairness, all experiments started from the same randomly initialized model, and used the same strategy for learning rate scheduling. The data were repartitioned randomly onto the local workers every epoch.

### 6.1 Experimental Results on CIFAR-10

The CIFAR-10 dataset consists of a training set of 50k images and a test set of 10k images in 10 classes. We trained a 20-layer ResNet model on this dataset (without data augmentation). For all the algorithms under investigation, we performed training for 160 epochs, with a mini-batch size of 128, and an initial learning rate which was reduced by ten times after 80 and 120 epochs following the practice in He et al. (2016). We performed grid search for the hyper-parameter and the best test performances are obtained by choosing the initial learning rate , for DC-ASGD-c, and , for DC-ASGD-a. We tried different numbers of local workers in our experiments: .

First, we investigate the learning curves with fixed number of effective passes as shown in Figure 2. From the figure, we have the following observations: (1) Sequential SGD achieves the best accuracy, and its final test error is 8.65%. (2) The test errors of ASGD and SSGD increase with respect to the number of local workers. In particular, when , ASGD and SSGD achieve test errors of 9.27% and 9.17% respectively; and when , their test errors become 10.26% and 10.10% respectively. These results are reasonable: ASGD suffers from delayed gradients which becomes more serious for a larger number of workers; SSGD increases the effective mini-batch size by times, and enlarged mini-batch size usually affects the training performances of DNN. (3) For DC-ASGD, no matter which is used, its performance is significantly better than ASGD and SSGD, and catches up with sequential SGD. For example, when , the test error of DC-ASGD-c is 8.67%, which is indistinguishable from sequential SGD, and the test error for DC-ASGD-a is 8.19%, which is even better than that achieved by sequential SGD. It is not by design that DC-ASGD can beat sequential SGD. The test performance lift might be attributed to the regularization effect brought by the variance introduced by parallel training. When , DC-ASGD-c can reduce the test error to 9.27%, which is nearly 1% better than ASGD and SSGD, meanwhile the test error is 8.57% for DC-ASGD-a, which again slightly better than sequential SGD.

We further compared the convergence speeds of different algorithms as shown in Figure 3. From this figure, we have the following observations: (1) Although the convergent point is not very good, ASGD runs indeed very fast, and achieves almost linear speed-up as compared to sequential SGD in terms of throughput. (2) SSGD also runs faster than sequential SGD. However, due to the synchronization barrier, it is significantly slower than ASGD. (3) DC-ASGD achieves very good balance between accuracy and speed. On one hand, its converge speed is very similar to that of ASGD (although it involves a little more computational cost and some memory cost when compensating the delay). On the other hand, its convergent point is as good as, or even better than that of sequential SGD. The experiments results clearly demonstrate the effectiveness of our proposed delay compensation technologies11.

### 6.2 Experimental Results on ImageNet

In order to further verify our method on the large-scale setting, we conducted the experiment on the ImageNet dataset, which contains 1.28 million training images and 50k validation images in 1000 categories. We trained a 50-layer ResNet model He et al. (2016) on this dataset.

According to the previous subsection, DC-ASGD-a seems to be better, therefore in this large-scale experiment, we only implemented DC-ASGD-a. For all algorithms in this experiment, we performed training for 120 epochs , with a mini-batch size of 32, and an initial learning rate reduced by ten times after every 30 epochs following the practice in He et al. (2016). We did grid search for hyperparameter tuning and set the initial learning rate , , . Since the training on the ImageNet dataset is very time consuming, we employed GPU nodes in our experiments. The top-1 accuracies based on 1-crop testing of different algorithms are given in Figure 4.

According to the figure, we have the following observations: (1) After processing the same amount of training data, DC-ASGD always outperforms SSGD and ASGD. In particular, while the eventual test error achieved by ASGD and SSGD were 25.64% and 25.30% respectively, DC-ASGD achieved a lower error rate of 25.18%. Please note this time the accuracy of SSGD is quite good (which is consistent with a separate observation in Chen et al. (2016)). An explanation is that the training on ImageNet is less sensitive to the mini-batch size than that on CIFAR-10. (2) If we look at the learning curve with respect to wallclock time, SSGD is slowed down due to the synchronization barrier; ASGD and DC-ASGD have similar efficiency, once again indicating that the extra overhead for delay compensation introduced by DC-ASGD can almost be neglected in practice. Based on all our experiments, we can clearly see that DC-ASGD has outstanding performance in terms of both classification accuracy and convergence speed, which in return verifies the soundness of our proposed delay compensation technologies.

## 7 Conclusion

In this paper, we have given a theoretical analysis on the problem of delayed gradients in the asynchronous parallelization of stochastic gradient descent (SGD) algorithms, and proposed a novel algorithm called Delay Compensated Asynchronous SGD (DC-ASGD) to tackle the problem. We have evaluated DC-ASGD on CIFAR-10 and ImageNet datasets, and the results demonstrate that it can achieve better accuracy than both synchronous SGD and asynchronous SGD, and nearly approaches the performance of sequential SGD. As for the future work, we plan to test DC-ASGD on larger computer clusters, where with the increasing number of local workers, the delay will become more serious. Furthermore, we will investigate the economical approximation of higher-order items in the Taylor expansion to achieve more effective delay compensation.

## Appendix A Theorem 3.1 and Its Proof

Theorem 3.1:

Assume the loss function is -Lipschitz. If make the following inequality holds,

 K∑k=11σ3k(x,wt)≥2⎡⎣Cij(K∑k=11σk(x,wt))2+C′ijL21|ϵt|⎤⎦, (15)

where , , and the model converges to the optimal model, then the MSE of is smaller than the MSE of in approximating Hessian .

Proof:

For simplicity, we abbreviate as , as and as . First, we calculate the MSE of , to approximate for each element of . We denote the element in the -th row and -th column of as and as .

The MSE of :

 E(Gtij−EHtij)2=E(Gtij−EGtij)2+(EHtij−EGtij)2=E(Gtij)2−(EGtij)2+ϵ2t (16)

The MSE of :

 E(λGtij−EHtij)2 =λ2E(Gtij−EGtij)2+(EHtij−λEGtij)2 =λ2E(Gtij)2−λ2(EGtij)2+(1−λ)2(EGtij)2+ϵ2t+2(λ−1)EGtijϵt (17)

The condition for is

 (1−λ2)(E(Gtij)2−(EGtij)2)≥2(1−λ)(EGtij)2+2(λ−1)EGtijϵt (18)

Inequality (18) is equivalent to

 (1+λ)E(Gtij)2≥2[(EGtij)2−EGtijϵt] (19)

Next we calculate , and which appear in Eqn.(19). For simplicity, we denote as , and as . Then we can get:

 E(gij)2 =E(Y|x,wt)(∂∂wilogP(Y|x,wt))2(∂∂wjlogP(Y|x,wt))2 (20) ≥E(Y|x,w∗)(K∑k=1(−zkσk))4(lilj)2 =α(lilj)2(K∑k=11σ3k(x,wt)) (21) (Ehij)2 =(E(Y|x,w∗)K∑k=1∂σk∂wi(−zkσk)⋅K∑k=1∂σk∂wj(−zkσk))2 ≤β2(uiuj)2(K∑k=11σk(x,wt))2. (22)

By substituting Ineq.(21) and Ineq.(22) into Ineq.(19), a sufficient condition for Ineq.(19) to be satisfied is because .

## Appendix B Corollary 3.2 and Its Proof

Corollary 3.2: A sufficient condition for inequality (15) is and such that .

Proof:
Denote and . If such that , we have for . Therefore

 F(σ1,...,σK) ≥1(σk1)3+K−1Δ3−2Cij(1σk1+K−1Δ)2−2C′ijL21|ϵt| (23) ≥K−1Δ3−2Cij⎛⎝(K−1Δ)2+1σ2k1+2(K−1)σk1Δ⎞⎠−2C′ijL21|ϵt| (24) ≥K−1Δ3−2Cij((K−1)2Δ2+2K−1σk1Δ)−2C′ijL21|ϵt| (25) =1Δ(K−1Δ2−2Cij((K−1)2Δ+2K−1σk1))−2C′ijL21|ϵt| (26) ≥1Δ(K−1Δ2−2Cij((K−1)2+2K−1Δ))−2C′ijL21|ϵt| (27) ≥1Δ2(K−1Δ−2CijK2−2C′ijL21|ϵt|) (28) =0 (29)

where Ineq.(25) and (27) is established since ; and Eqn.(29) is established by putting in Eqn.(28).

## Appendix C Uniform upper bound of MSE

###### Lemma C.1

Assume the loss function is -Lipschitz, and the diagonalization error of Hessian is upper bounded by , i.e., , 12 then we have, for ,

 mset(Diag(λG))≤4λ2V1+4(1−λ)2L41+4ϵ2t+4ϵD, (30)

where is the upper bound of the variance of .

Proof:

 mset(Diag(λG)) (31) ≤ E∥Diag(λG(wt))−H(wt)∥2 (32) ≤ 4E∥Diag(λG(wt))−E(Diag(λ