Communication-Efficient Local Decentralized SGD Methods

# Communication-Efficient Local Decentralized SGD Methods

## Abstract

Recently, the technique of local updates is a powerful tool in centralized settings to improve communication efficiency via periodical communication. For decentralized settings, it is still unclear how to efficiently combine local updates and decentralized communication. In this work, we propose an algorithm named as LD-SGD, which incorporates arbitrary update schemes that alternate between multiple Local updates and multiple Decentralized SGDs, and provide an analytical framework for LD-SGD. Under the framework, we present a sufficient condition to guarantee the convergence. We show that LD-SGD converges to a critical point for a wide range of update schemes when the objective is non-convex and the training data are non-identically independent distributed. Moreover, our framework brings many insights into the design of update schemes for decentralized optimization. As examples, we specify two update schemes and show how they help improve communication efficiency. Specifically, the first scheme alternates the number of local and global update steps. From our analysis, the ratio of the number of local updates to that of decentralized SGD trades off communication and computation. The second scheme is to periodically shrink the length of local updates. We show that the decaying strategy helps improve communication efficiency both theoretically and empirically.

## 1 Introduction

We study distributed optimization where the data are partitioned among worker nodes; the data are not necessarily identically distributed. We seek to learn the model parameter (aka optimization variable) by solving the following distributed empirical risk minimization problem:

 minx∈Rd f(x):=1nn∑k=1fk(x), (1)

where and is the distribution of data on the -th node with . Such a problem is traditionally solved under centralized optimization paradigms such as parameter servers Li et al. (2014). Federated Learning (FL), which often has a central parameter server, enables massive edge computing devices to jointly learn a centralized model while keeping all local data localized Konevcnỳ et al. (2015); McMahan et al. (2017); Konevcnỳ (2017); Li et al. (2019a); Sahu et al. (2019). As opposed to centralized optimization, decentralized optimization lets every worker node collaborate only with their neighbors by exchanging information. A typical decentralized algorithm works in this way: a node collects its neighbors’ model parameters , takes a weighted average, and then performs a (stochastic) gradient descent to update its local parameters Lian et al. (2017). Decentralized optimization can outperform the centralized under specific settings Lian et al. (2017).

Decentralized optimization, as well as the centralized, suffers from high communication costs. The communication cost is the bottleneck of distributed optimization when the number of model parameters or the number of worker nodes are large. It is well known that deep neural networks have a large number of parameters. For example, ResNet-50 He et al. (2016) has 25 million parameters, so sending through a computer network can be expensive and time-consuming. Due to modern big data and big models, a large number of worker nodes can be involved in distributed optimization, which further increases the communication cost. The situation can be exacerbated if the worker nodes in distributed learning are remotely connected, which is the case in edge computing and other types of distributed learning.

In recent years, to directly save communication, many researchers let more local updates happen before each synchronization in centralized settings. A typical and famous example is Local SGD McMahan et al. (2017); Lin et al. (2018); Stich (2018); Wang and Joshi (2018b, a). As its decentralized counterpart, Periodic Decentralized SGD (PD-SGD) alternates between a fixed number of local updates and one step of decentralized SGD Wang and Joshi (2018b). However, its update scheme is too rigid to balance the trade-off between communication and computation efficiently Wang et al. (2019). It is still unclear how to combine local updates and decentralized communications efficiently in decentralized and non-iid data settings.

Contributions. To answer the question, in the paper, we propose a meta algorithm termed as LD-SGD, which is able to incorporate arbitrary update schemes for decentralized optimization. We provide an analytical framework, which sheds light on the relationship between convergence and update schemes. We show that LD-SGD converges with a wide choice of communication patterns for non-convex stochastic optimization problems and non-identically independently distributed training data (i.e., are not the same).

We then specify two update schemes to illustrate the effectiveness of LD-SGD. For the first scheme, we let LD-SGD alternate (i.e., steps of) multiple local updates and multiple (i.e., steps of) decentralized SGDs; see the illustration in Figure 1. A reasonable choice of could better trade off the balance between communication and computation both theoretically and empirically.

We observe a phenomenon that more local computation (i.e., large ) often leads to a quick initial drop of loss function but higher final errors, while more communication (i.e., small ) often results in a lower error floor but higher communication cost. Therefore, in the second scheme, we propose and analyze a decaying strategy that periodically halves . From our framework, we theoretically verify the efficiency of the strategy.

## 2 Related Work

Federated optimization. The optimization problem implicit in FL is referred to as federated optimization, drawing a connection (and contrast) to distributed optimization. Currently, the state-of-the-art algorithm in federated optimization is Federated Averaging (FedAvg) Konecnỳ et al. (2016); McMahan et al. (2017), which is a centralized optimization method and is also referred to as Local SGD. In every iteration of FedAvg, a small subset of nodes is activated, and it alternates between multiple local SGDs and sends updated parameters to the central server. With perhaps unrealistic assumptions such as identical ’s or all nodes activated, the convergence of FedAvg has been analyzed by Zhou and Cong (2017); Stich (2018); Wang and Joshi (2018b); Yu et al. (2019b). Li et al. (2019b) for the first time analyzed FedAvg in more realistic federated setting (i.e., non-identical ’s and partial activated nodes). PD-SGD is an extension of FedAvg (or Local SGD) towards decentralized optimization Wang and Joshi (2018b); Haddadpour and Mahdavi (2019). MATCHA Wang et al. (2019) extends PD-SGD to a more federated setting by only activating a random subgraph of the network topology each round. Our work can be viewed as an attempt to generalize FedAvg to decentralized settings.

Decentralized stochastic gradient descent (D-SGD). Decentralized (stochastic) algorithms were used as compromises when a powerful central server is not available. They were studied as consensus optimization in the control community Ram et al. (2010); Yuan et al. (2016); Sirb and Ye (2016). Lian et al. (2017) justified the potential advantage of D-SGD over its centralized counterpart. D-SGD not only reduces the communication cost but achieves the same linear speed-up as centralized counterparts when more nodes are available Lian et al. (2017). This promising result pushes the research of distributed optimization from a sheer centralized mechanism to a more decentralized pattern Lan et al. (2017); Tang et al. (2018b); Koloskova et al. (2019); Wang et al. (2019); Luo et al. (2019).

Communication efficient algorithms. The current methodology towards communication-efficiency in distributed optimization could be divided into two categories. The more direct approach is to reduce the size of the messages through gradient compression or sparsification Seide et al. (2014); Lin et al. (2017b); Zhang et al. (2017); Tang et al. (2018a); Wang et al. (2018); Horváth et al. (2019). An orthogonal one is to pay more local computation for less communication, e.g., one-shot aggregation Zhang et al. (2013, 2015); Lee et al. (2017); Lin et al. (2017a); Wang (2019), primal-dual algorithms Smith et al. (2016, 2017); Hong et al. (2018) and distributed Newton methods Shamir et al. (2014); Zhang and Lin (2015); Reddi et al. (2016); Shusen Wang et al. (2018); Mahajan et al. (2018). Beyond them, a simple but powerful method is to reduce the communication frequency by allowing more local updates Zinkevich et al. (2010); Stich (2018); Lin et al. (2018); You et al. (2018); Wang and Joshi (2018b), which we focus on in this paper.

Most related work. Our work is most closely related with ones in Wang and Joshi (2018b); Wang et al. (2019). Specifically, Wang and Joshi (2018b) proposed PD-SGD that can also combine decentralization and local updates. However, they only considered the case of one step of decentralized SGD after a fixed number of local updates. Moreover, they analyzed PD-SGD by assuming all worker nodes have access to the underlying distribution (hence data are identically distributed). The algorithm shown in Figure 1 is an extension of PD-SGD by introducing a new parameter (i.e., ) of controlling the length of decentralized SGDs. helps us better balance the communication-computation trade-off.

MATCHA Wang et al. (2019) makes communication happen only among a random small portion of worker nodes at each round1. When no node is activated, local updates come in. Consequently, the theory of MATCHA is formulated for random connection matrices (i.e., in our case) and does not straightforwardly extend to a deterministic sequence of . Our work mainly studies a deterministic sequence of but could also extend to random sequences.

## 3 Notation and Preliminaries

Decentralized system. Conventionally, a decentralized system can be described by a graph where is an doubly stochastic matrix describing the weights of the edges. A nonzero entry indicates that the -th and -th nodes are connected.

###### Definition 1.

We say a matrix to be symmetric and doubly stochastic, if is symmetric and each row of is a probability distribution over the vertex set , i.e., , and .

Notation. Let be the optimization variable (aka model parameters in machine learning language) held by the -th node. The step is indicated by a subscript, e.g., is the parameter held by the -th node in step . Note that at any time moment, may not be equal. The concatenation of all the variables is

 X:=[x(1),⋯,x(n)]∈Rd×n.

The averaged variable is . The derivative of w.r.t.  is and the concatenated gradient evaluated at with datum is

 G(X;ξ):=[∇F1(x(1);ξ(1)),⋯,∇Fn(x(n);ξ(n))].

We denote the set of natural numbers by . We define and means the interval between the positive integers and , i.e., if , , otherwise . For any set and real number , we define .

Decentralized SGD (D-SGD). D-SGD works in the following way Bianchi et al. (2013); Lan et al. (2017). At Step , the -th node randomly chooses a local datum , and uses its current local variable to evaluate the stochastic gradient . Then each node performs stochastic gradient descent (SGD) to obtain an intermediate variable and finally finishes the update by collecting and aggregating its neighbors’ intermediate variables:

 x(k)t+12 ⟵ x(k)t−η∇Fk(x(k)t;ξ(k)t), (2) x(k)t+1 ⟵ ∑l∈Nkwklx(l)t+12, (3)

where contains the indices of the -th node’s neighbors. In matrix form, this can be captured by . Obviously, D-SGD requires communications per steps.

###### Remak 1.

The order of Step 2 and Step 3 in D-SGD can be exchanged . In this way, we first average the local variable with neighbors and then update the local stochastic gradient into the local variable. The update rule becomes . The benefit is that the computation of stochastic gradients (i.e., ) and communication (i.e., Step 3) can be run in parallel. Our theory in Section 4.2 is applicable to these cases.

## 4 Local Decentralized SGD (LD-SGD)

In this section, we first propose LD-SGD which is able to incorporate arbitrary update schemes. Then we present convergence analysis for it.

### 4.1 The Algorithm

Algorithm 1 summarizes the proposed LD-SGD algorithm. It periodically performs multiple local updates and D-SGD. Without local updates, LD-SGD would be the standard D-SGD algorithm. Let index the steps where decentralized SGD is performed. Our theory allows for arbitrary . We can write the resulting algorithm in matrix form:

 Xt+1=(Xt−G(Xt;ξt))Wt,

where is the connected matrix defined by

 Wt={Inif t∉IT,Wif t∈IT. (4)

Here is a prespecified doubly stochastic matrix. Different choices of give rise to different update schemes and then lead to different communication efficiency. For example, when we choose where is the communication interval, LD-SGD recovers the previous work PD-SGD Wang and Joshi (2018b). Therefore, it is natural to explore how different affects the convergence of LD-SGD.

###### Remak 2.

Actually, our framework can apply to any arbitrary deterministic doubly stochastic matrix sequence2. But this will make it complicated to present the conclusion. For brevity, we use the same to define for .

### 4.2 Convergence Analysis

#### Assumptions

In Eq. (1), we define as the objective function of the -th node. Here, is the optimization variable and is a data sample. Note that captures the data distribution in the -th node. We make a standard assumption: are smooth.

###### Assumption 1 (Smoothness).

For all , is smooth with modulus , i.e.,

 ∥∥∇fk(x)−∇fk(y)∥∥≤L∥∥x−y∥∥,∀ x,y∈Rd.

We assume the stochastic gradients have bounded variance. The assumption has been made by the prior work Lian et al. (2017); Wang and Joshi (2018b); Tang et al. (2018b, a).

###### Assumption 2 (Bounded variance).

There exists some such that ,

 Eξ∼Dk∥∥∇Fk(x;ξ)−∇fk(x)∥∥2≤σ2,∀ x∈Rd.

Recall from Eq. (1) that is the global objective function. If the data distributions are not identical, that is, for , then the global objective is not the same to the local objectives. In this case, we define to quantify the degree of non-iid. If the data across nodes are iid, then .

###### Assumption 3 (Degree of non-iid).

There exists some such that

Finally, we need to assume the nodes are well connected; otherwise, the update in one node cannot be propagated to another node within a few iterations. In the worst case, if the system is not fully connected, the algorithm will not converge. We use to quantify the connectivity where is the second largest absolute eigenvalue of . A small indicates nice connectivity. If the connection forms a complete graph, then , and thus .

###### Assumption 4 (Nice connectivity).

The connectivity matrix is symmetric doubly stochastic. Denote its eigenvalues by . We assume the spectral gap where .

#### Main Results

Recall that is defined as the averaged variable in the -th iteration. Note that the objective function is often non-convex when neural networks are applied. is very important in our theory because it captures the characteristics of each update scheme. All the proof can be found in Appendix A.

###### Definition 2.

For any , define where with given in (4). Actually, we have , where and is defined in Assumption 4.

###### Theorem 1 (LD-SGD with any IT).

Let Assumptions 1234 hold and the constants , , , and be defined therein. Let be the initial error. For any fixed , define

 AT=1TT∑t=1t−1∑s=1ρ2s,t−1, BT=1TT∑t=1(t−1∑s=1ρs,t−1)2, CT=maxs∈[T−1]T∑t=s+1ρs,t−1(t−1∑l=1ρl,t−1).

If the learning rate is small enough such that

 η

then

 1TT∑t=1E∥∥∇f(¯¯¯xt)∥∥2 ≤2ΔηT+ηLσ2nfully sync SGD+(a0.3b)4η2L2(ATσ2+BTκ2)% residual error. (6)
###### Corollary 1.

If we choose the learning rate as in Theorem 1, then when , we have,

 1TT∑t=1E∥∇f(¯¯¯xt)∥2≤2Δ+Lσ2√nT+4nL2(ATσ2+BTκ2)T. (7)

Sufficient condition for convergence. If the chosen satisfies the sublinear condition that

 AT=o(T), BT=o(T) and CT=o(T), (8)

we thereby prove the convergence of LD-SGD with the update scheme to a stationary point, e.g., a local minimum or saddle point. However, not every update scheme satisfies (8). For example, when , we have for all and thus and . But Theorem 2 shows that as long as is small enough, the sublinear condition holds. So there is still a wide range of that meets the condition.

###### Definition 3 (Gap).

For any set with for , the gap of is defined as

 gap(IT)=maxi∈[g+1](ei−ei−1) where e0=0,eg+1=T.
###### Theorem 2.

Let be defined in Theorem 1. Then for any , we have

 AT≤gap(IT)1+ρ21−ρ2, max{BT,CT}≤gap(IT)2(1−ρ)2.

## 5 Two Proposed Update Schemes

Before we move to the discussion of our results, we first specify two classes of update schemes, both of which satisfy the sublinear condition (8). The proposed update schemes also deepen our understandings of the main results.

### 5.1 Adding Multiple Decentralized SGDs

One centralized average can synchronize all local models, while it often needs multiple decentralized communication to achieve global consensus. Therefore, it is natural to introduce multiple decentralized SGDs (D-SGD) to . In particular, we set

 I1T={t∈[T]:t mod (I1+I2)∉[I1]},

where are parameters that respectively control the length of local updates and D-SGD.

In this way, each worker node periodically alternates between two phases. In the first phase, each node locally runs steps of SGD in parallel3. In the second phase, each worker node runs steps of D-SGD. As mentioned, D-SGD is a combination of Step 2 and Step 3. So communication only happens in the second phase; a worker node performs communication per steps. Figure 1 illustrates one round of LD-SGD with when and . When LD-SGD is equipped with , the corresponding are w.r.t. . The proof is provided in Appendix B.

###### Theorem 3 (LD-SGD with I1T).

When we set for PD-SGD, under the same setting, Theorem 1 holds with

 AT≤12I(1+ρ2I21−ρ2I2I21+1+ρ21−ρ2I1)+ρ21−ρ2, max{BT,CT}≤K2,K=I11−ρI2+ρ1−ρ.

Therefore, LD-SGD converges with .

The introduction of extends the scope of previous framework: Cooperative SGD Wang and Joshi (2018b). As a result, many existing algorithms become special cases when the period lengths , and the connected matrix are carefully determined. As an evident example, we recover D-SGD by setting and 4 and the conventional PD-SGD by setting and . Another important example is Local SGD (or FedAvg) that periodically averages local model parameters in a centralized manner Zhou and Cong (2017); Lin et al. (2018); Stich (2018); Yu et al. (2019b). Local SGD is a case with , and . We summarize examples and the comparison with their convergence results in Appendix E.

### 5.2 Decaying the Length of Local Updates

From Figure 2, larger local computation ratio (i.e., ) converges faster but incurs a higher final error, while lower local computation ratio enjoys a smaller final error but sacrifices the convergence speed. A similar phenomena is observed by an independent work Wang and Joshi (2018a), which finds that a faster initial drop of global loss often accompanies a higher final error.

To take the advantage of both cases, we are inspired to decay every rounds until vanishes. In particular, we first define an ancillary set

 I(I1,I2,M)={t∈[M(I1+I2)]:t mod (I1+I2)∉[I1]},

and then recursively define and

where returns the maximum number collected in and . Finally we set

 I2T=∪Jj=0Jj∪[max(JJ):T].

From the recursive definition, once , is reduced to zero and LD-SGD is reduced to D-SGD.

When LD-SGD is equipped with , the corresponding are w.r.t. . The proof is provided in Appendix C.

###### Theorem 4 (LD-SGD with I2T).

When we set for LD-SGD, under the same setting, for , Theorem 1 holds with

 AT≤1TI11−ρ2I2ρ2(T−max(JJ))+(1−max(JJ)T)ρ21−ρ2, BT≤K[1TI11−ρI2ρT−max(JJ)+(1−max(JJ)T)ρ1−ρ], CT≤K2,

where is the same in Theorem 3. Therefore, LD-SGD converges with .

From experiments in Section 7, the simple strategy empirically performs better than the PD-SGD.

## 6 Discussion

In this section, we will discuss some aspects of our main results (Theorem 1) and shed light on advantages of proposed update schemes.

Error decomposition. From Theorem 1, the upper bound (1) is decomposed into two parts. The first part is exactly the same as the optimization error bound in parallel SGD Bottou et al. (2018). The second part is termed as residual errors as it results from performing periodic local updates and reducing inter-node communication. In previous literature, the application of local updates inevitably results the residual error Lan et al. (2017); Stich (2018); Wang and Joshi (2018b); Haddadpour et al. (2019); Li et al. (2019b); Yu et al. (2019a).

To go a step further towards the residual error, take LD-SGD with for example. From Theorem 3, the residual error often grows with the length of local updates . When data are independently and identical distributed 5 (i.e., ), Wang and Joshi (2018b) shows that the residual error of the conventional PD-SGD grows only linearly in .  Haddadpour et al. (2019) achieves the similar linear dependence on but only requires each node draws samples from its local partitions. When data are not identically distributed (i.e., is strictly positive), both Yu et al. (2019b) and Zhou and Cong (2017) show that the residual error of Local SGD grows quadratically in . Theorem 3 shows that the residual error of LD-SGD with is , where the linear dependence comes from the stochastic gradients and the quadratic dependence results from the heterogeneity. The similar dependence is also established for centralized momentum SGD in Yu et al. (2019a).

On Linear Speedup. Assume satisfies

 AT=O(√T) BT=O(√T) and CT=o(T). (9)

Note that Condition (9) is sufficient for the sublinear condition (8). From Corollary 1, the convergence of LD-SGD with will be dominated by the first term , when the total step is sufficiently large. So LD-SGD with can achieve a linear speedup in terms of the number of worker nodes. Both of and satisfy Condition (9). Taking LD-SGD with for example, we have

###### Corollary 2.

In the setting of Theorem 3, if we set and choose to satisfy that then the bound of becomes

 2Δ+Lσ2+4L2(σ2+κ2)√nT.

However, if fails to meet (9), the second term will dominate. As a result, more worker nodes may lead to slow convergence or even divergence. As suggested by Theorem 2, one way to make the first term dominate is to involve more communication.6

Communication Efficiency. LD-SGD with needs only communications per total steps. To increase communication efficiency, we are motivated to reduce the size of as much as possible. However, as suggested by Theorem 2, to guarantee convergence, we are required to make sure is sufficiently large (so that will be small enough). The trade-off between communication and convergence needs a careful design of update schemes. The two proposed update schemes have their own way of balancing the trade-off.

For LD-SGD with , it only needs communications per total steps where . From Corollary 2, to ensure , we have . Then the communication complexity of LD-SGD with is

 I2I2+(T14n−34−ρ1−ρ)(1−ρI2)T=O(T34n34),

which is an increasing function of . On the other hand, a large fastens convergence (since all bounds in Theorem 3 are decreasing in ). Therefore, the introduction of allows more flexibility to balance the trade-off between communication and convergence. From experiments, large often has higher communication efficiency. Besides, similar to Local SGD which has communication complexity in centralized settings Yu et al. (2019b), LD-SGD with also achieves that level.

For LD-SGD with , it has much faster convergence rate since from Theorem 4, the bounds of and are much smaller than those of . However, it needs communications per total steps, which is more than that of . To balance the trade-off, we can choose a reasonable . From experiments in Section 7, empirically performs better than the vanilla PD-SGD.

Effect of connectivity . The connectivity is measured by , the second largest absolute eigenvalue of . The network connectivity has impact on the convergence rate via . Each update scheme corresponds to one way that depends on .

Generally speaking, well-connectivity helps reduce residual errors and thus speed up convergence. If the graph is nicely connected, in which case is close to zero, then the update in one node will be propagated to all the other nodes very soon, and the convergence is thereby fast. As a result, the bounds in Theorem 2 are much smaller due to . On the other hand, if the network connection is very sparse (in which case ), will greatly slows convergence. Take LD-SGD with for example. When , from Theorem 3, the bound of and the bound of , both of which can be extremely large. Therefore, it needs more steps to converge.

Effect of variance and . Undoubtedly, the existence of and negatively affects convergence. The former is common in stochastic optimization due to the variance incurred by computing gradients. The latter results from the nature of non-identical data distribution. Non-iid data is very commonplace in Federated Learning. The fact that shows the impact of non-iid data is much stronger than that of stochastic gradients.

## 7 Experiments

Experiment setup. We evaluate LD-SGD with two proposed update schemes ( and ) using the CIFAR-10 dataset which has ten classes of natural images. We set the number of worker nodes to and connect every node with 10 nodes. The connection graph is sparse, and the second biggest eigenvalue is big: . To make the objective functions heterogeneity, we let each node contain samples random selected from two classes. We build a small convolutional neural network (CNN) by adding the following layers one by one: ReLU ReLU Softmax. There are totally trainable parameters. We choose the best learning rate from . We set and evaluate the averaged model every 10 global steps on the global loss (1). If the abscissa axis is named as ’steps’, we show global training loss v.s. total steps. Otherwise for ’communication steps’, each unit in the abscissa axis represents once communication. In this way, we can measure the communication efficiency, i.e. how the global training loss decreases when a communication step (i.e., Step 2 and Step 3) is performed.

Convergence against computation. Figure 4 shows the convergence speed of under different configurations. When is fixed as 10, larger leads to faster convergence in terms of computation. As an extreme, converges fastest (but requires the most communication). What’s more, larger has less final error.

Convergence against communication. Figure 6 shows the communication efficiency of under different configurations. For a fixed , a big requires less communications (which precisely needs communications) and has a fast drop of the global loss (which we have also mentioned in Figure 2). However, as mentioned, large unfortunately incurs high level of final errors. We speculate the reason is that at the beginning, the optimization parameters are far away from any stationary point, and more local updates will accelerate the move towards it. When it is close enough to a good parameter region (e.g., the neighborhood of stationary points), more local updates inevitably increases the residual errors and thus deteriorates the ultimate loss level. It is natural to combine the choice of parameters more organically; here is our motivation to propose a decay strategy that gradually decreases .

Decaying . The above empirical observation suggests using a big in the beginning and a small in the end. We decay by half every 200 rounds of communication, i.e., about 2000 steps initially. As argued, the decaying strategy helps reduce the final error. From Figure 6, once we periodically reduces , all choice of have a similar level of final errors. Figure 6 shows that with the decay strategy is the most efficient method. In practical, we can set a little bit higher (e.g., ) to take the advantage of fast convergence.

## 8 Conclusion

In this paper we have proposed a meta algorithm LD-SGD, which can be combined with any update scheme . Our analytical framework has shown that with any that has sufficiently small , LD-SGD converges under the setting of stochastic non-convex optimization and non-identically distributed data. We have also presented two novel update schemes, one adding multiple D-SGDs (denoted ) and the other empirically reducing the length of local updates (denoted by ) . Both the schemes help improve communication efficiency theoretically and empirically. The framework we proposed might help researchers to design more efficient update schemes.

Appendix

## Appendix A Proof of Main Result

### a.1 Additional notation

In the proofs we will use the following notation. Let be defined in Section 3 previously. Let

 ¯¯¯g(X;ξ):=1nG(X;ξ)1n=1nn∑k=1Fk(x(k);ξ(k))∈Rd

be the averaged gradient. Recall from (1) the definition . We analogously define

 ∇f(X) :=E[G(X;ξ)]=[∇f1(x(1)),⋯,∇fn(x(n))]∈Rd×n, ¯¯¯¯¯¯¯¯∇f(X) :=E[¯¯¯g(X;ξ)]=1n∇f(X)1n=1nn∑k=1∇fk(x(k))∈Rd, ∇f(¯¯¯x) :=¯¯¯¯¯¯¯¯∇f(¯¯¯x)=1nn∑k=1∇fk(¯¯¯x)∈Rd.

Let and . Define the residual error as

 (10)

where the expectation is taken with respect to all randomness of stochastic gradients or equivalently where . Except where noted, we will use notation in stead of for simplicity. Hence .

As mentioned in Section 3, LD-SGD with arbitrary update scheme can be equivalently written in matrix form which will be used repeatedly in the following convergence analysis. Specifically,

 Xt+1=(Xt−G(Xt;ξt))Wt (11)

where is the concatenation of , is the concatenated gradient evaluated at with the sampled datum , and is the connected matrix defined by

 Wt={Inif t∉IT;Wif t∈IT. (12)

### a.2 Useful lemmas

The main idea of proof is to express in terms of gradients and then develop upper bound on residual errors. The technique to bound residual errors can be found in Wang and Joshi [2018a, b]; Wang et al. [2019]; Yu et al. [2019b, a].

###### Lemma 1 (One step recursion).

Let Assumptions 1 and 2 hold and and be defined therein. Let be the learning rate. Then the iterate obtained from the update rule (11) satisfies

 (13)

where the expectations are taken with respect to all randomness in stochastic gradients.

###### Proof.

Recall that from the update rule (11) we have

 ¯¯¯xt+1=¯¯¯xt−η¯¯¯g(Xt,ξt).

When Assumptions 1 and 2 hold, it follows directly from Lemma 8 in Tang et al. [2018b] that

 E[f(¯¯¯xt+1)]≤E[f(¯¯¯xt)]−η2E∥∥∇f(¯¯¯xt)∥∥2 −η2(1−ηL)E∥∥¯¯¯¯¯¯¯¯∇f(Xt)∥∥2+Lσ2η22n +η2E∥∥∇f(¯¯¯xt)−¯¯¯¯¯¯¯¯∇f(Xt)∥∥2.

The conclusion then follows from

 E∥∇f(¯¯¯xt)−¯¯¯¯¯¯¯¯∇f(Xt)∥2 =1n2E∥∥∥n∑k=1[fk(¯¯¯xt)−fk(x(k)t)]∥∥∥2 =L2Vt

where (a) follows from Jensen’s inequality, (b) follows from Assumption 1, and is defined in (10). ∎

###### Lemma 2 (Residual error decomposition).

Let be the initialization. If we apply the update rule (11), then for any ,

 Xt(In−Q)=−ηt−1∑s=1G(Xs;ξs)(\boldmathΦ\unboldmaths,t−1−Q) (14)

where is defined in (15) and is given in (12).

 \boldmathΦ\unboldmaths,t−1={Inif s≥t∏t−1l=sWlif s
###### Proof.

For convenience, we denote by the concatenation of stochastic gradients at iteration . According to the update rule, we have

 Xt(In−Q) =(Xt−1−ηGt−1)Wt−1(In−Q) (a)=Xt−1(In−Q)Wt−1−ηGt−1(Wt−1−Q) (b)=Xt−l(In−Q)t−1∏s=t−lWs−ηt−1∑s=t−lGs(\boldmathΦ% \unboldmaths,t−1−Q) (c)=X1(In−Q)\boldmathΦ\unboldmath1,t−1−ηt−1∑s=1Gs(\boldmath% Φ\unboldmaths,t−1−Q)

where (a) follows from ; (b) results by iteratively expanding the expression of from to and plugging in the definition of in (12); (c) follows simply by setting . Finally, the conclusion follows from the initialization which implies . ∎

###### Lemma 3 (Gradient variance decomposition).

Given any sequence of deterministic matrices