Local Averaging Helps: Hierarchical Federated Learning and Convergence Analysis

# Local Averaging Helps: Hierarchical Federated Learning and Convergence Analysis

## Abstract

Federated learning is an effective approach to realize collaborative learning among edge devices without exchanging raw data. In practice, these devices may connect to local hubs which are then connected to the global server (aggregator). Due to the (possibly limited) computation capability of these local hubs, it is reasonable to assume that they can perform simple averaging operations. A natural question is whether such local averaging is beneficial under different system parameters and how much gain can be obtained compared to the case without such averaging. In this paper, we study hierarchical federated learning with stochastic gradient descent (HF-SGD) and conduct a thorough theoretical analysis to analyze its convergence behavior. The analysis demonstrates the impact of local averaging precisely as a function of system parameters. Due to the higher communication cost of global averaging, a strategy of decreasing the global averaging frequency and increasing the local averaging frequency is proposed. Experiments validate the proposed theoretical analysis and the advantages of hierarchical federated learning.

\@footnotetext \@footnotetext

Preprint. Work in progress.

## 1 Introduction

Federated learning (FL) is an emerging distributed machine learning technique that enables learning at a large scale from decentralized data generated by client devices [19, 11, 8, 14, 21, 23]. In particular, it has been proposed to be applied to wireless edge and mobile networks, where a massive number of devices collaborate together to perform a learning task without exchanging raw data [14, 13, 7, 20]. Compared to conventional machine learning techniques where data need to be transmitted to and stored at a central location, the benefits of FL include improved data privacy and reduced communication and storage requirements.

In a typical FL system, a server in the cloud acts as a parameter aggregator that receives iterative updates from clients distributed over a large geographical area. After receiving the updates, the server aggregates (averages) the parameters and sends the result back to clients for the next iteration. The challenge with this approach is that it includes a lot of communication over wide-area network (WAN), which is prone to unpredictable latency and packet loss especially in areas with poor network connections. This can cause poorly connected areas to be unfairly treated in the model training process, so that the FL result is in favor of data originated from well-connected clients.

To resolve this problem, we present a hierarchical FL technique in this paper, where two “levels” of parameter aggregation functionalities are distributed across the networked system, as shown in Figure 1. In particular, local averaging5 can be performed at a networked component that is in close proximity to a certain group of clients. As local communication bandwidth is usually much higher than WAN bandwidth, this approach can alleviate network congestion, increase the communication robustness of clients, and also reduce the overall ingress traffic at the server. From a system perspective, recent developments of edge computing and network function virtualization (NFV) provide computational capability to various components (e.g., edge nodes, gateways, routers) in future communication networks [1]. Local parameter averaging can be performed on such networked components.

To support hierarchical FL with local averaging, a fundamental question is: how does local averaging affect FL convergence and what is the impact of data distribution at FL clients on this process? In this paper, we address this question by conducting a thorough theoretical analysis with empirical validations. Our main contributions are summarized as follows.

• We present a hierarchical FL method based on stochastic gradient descent (SGD) (referred to as HF-SGD) that leverages local averaging within groups of clients.

• We theoretically analyze the convergence bound of HF-SGD for both fixed client grouping and uniformly random client grouping, with non-independent/non-identically distributed (non-IID) data across the clients and groups. This is the first work to analyze the effect of client grouping by considering both local and global divergences in HF-SGD. Our analytical bound clearly demonstrates the advantage of HF-SGD over single-level federated SGD (SF-SGD).

• Our theoretical results lead to useful insights such as how local and global averaging periods (intervals) and data heterogeneity affect the convergence of HF-SGD. These are validated with experiments on real datasets. Our experiments show that HF-SGD brings the benefit of accelerated learning convergence while reducing the communication cost of global averaging.

## 2 Related Work

FL was first proposed by McMahan et al. [19] as a federated averaging (FedAvg) algorithm. It was later shown that FedAvg converges with non-IID client data [24, 15]. The non-IID data distribution at clients is one of the major challenges in FL, which is also known as data heterogeneity.

To improve the communication efficiency of FL, methods for adapting the parameter averaging period (interval) for efficient learning were proposed [23, 22]. Model/gradient sparsification and compression techniques were also applied to FL, which reduces the amount of information exchanged between clients and the server while maintaining a similar model accuracy [11, 9, 10, 6]. In addition, partial client participation can further reduce the computation and communication overhead [2], which, however, still requires the participating clients to communicate with the server directly and the convergence rate is inversely proportional to the number of participating clients [15]. These techniques are orthogonal to our work and may be applied together with our proposed hierarchical FL.

By moving parameter aggregation from the server down to the clients, decentralized SGD was proposed by Lian et al. [16, 17]. However, it does not address the common FL scenario where an aggregation is only performed once after multiple local iterations for communication efficiency, and does not support a hierarchical setup with aggregation at local hubs either. Some recent works have considered similar settings as hierarchical FL. Chen et al. [5] studied sequential updates from each group, which is different from our HF-SGD method that allows parallel updates with two levels of (local and global) aggregation. Independent and identically distributed (IID) data was considered in the work by Castiglia et al. [4], which does not capture the impact of data heterogeneity. The result by Liu et al. [18] has a bound that is exponential in the parameter aggregation period (interval) and can become loose if the period is large.

In this paper, different from the above existing works, we derive the convergence bound of hierarchical FL (more specifically, HF-SGD) that captures the impact of data heterogeneity within and across groups of clients, based on which useful insights are observed. In addition, our bound is tight and degenerates to the best known results for SF-SGD and non-federated (i.e., centralized) SGD. Our analysis is challenging due to the complex interplay between local aggregation (within a group) and global aggregation (across groups), data heterogeneity, and convergence.

In the remainder of this paper, Section 3 describes the overall procedure of HF-SGD and Section 4 gives its convergence analysis. Strategies for reducing communication cost are discussed in Section 5. The experimental results are presented in Section 6. Finally, Section 7 draws conclusion.

## 3 Hierarchical Federated Stochastic Gradient Descent (HF-SGD)

In classical FedAvg [19], a set of clients collaboratively minimize the overall empirical risk in an iterative manner. After a number of local SGD iterations (referred to as a local period) computed on each client’s local data, the resulting model parameters are sent to the server, which performs an averaging operation to obtain an intermediate global model with parameter vector . Then, is sent to all clients, and the FL process continues until convergence. As mentioned earlier, we refer to this classical FedAvg method as SF-SGD, to differentiate from our HF-SGD technique.

In our hierarchical FL system (see Figure 1), all clients are partitioned into groups . The number of clients in each group is denoted by (). The grouping can be a natural result of clients connecting to different local hubs, where those clients that are connected to a common hub can be considered as one group. In this case, grouping is fixed based on the network topology. In another example, grouping can be based on the dynamic configuration of parameter aggregation functions and routing in an NFV-supported system, which is largely configurable at a logical level and allows strategies such as random grouping. In this paper, our main goal is to characterize the impact of grouping and show its advantages. For a given grouping, we can write the objective function as6

 minw∈Rf(w):=1NN∑i=11ni∑j∈ViFj(w) (1)

where is the model parameter vector and is the empirical risk function defined over the local dataset at client ( is the loss of data sample ). In this paper, we consider a general that is non-convex. Then, for each group , we define the intra-group objective function as

 fi(w):=1ni∑j∈ViFj(w). (2)

During each local iteration , each client updates its own model using SGD:

 wt+1j=wtj−γg(wtj,ζtj) (3)

where is the learning rate, is the stochastic gradient of , and represents random data samples from the local dataset at client . We assume that .

Instead of aggregating all local models at the global aggregator (server) at the end of a local period as in SF-SGD, models are first averaged within group () after every local iterations. In particular, at local iteration , we compute . This can be done at a local aggregator (e.g., a computational component in close proximity) for group . After several rounds of “intra-group” averagings, models from the groups are averaged globally. Let a global aggregation be performed for every local iterations. Then, at local iteration , we have for . Note that we assume that all the clients perform synchronous updates and we let be a common multiple of . Therefore, FL is conducted both at the local level within a group and the global level across groups. This design is referred to as HF-SGD, which is summarized in Algorithm 1, where (or ) denotes that divides (or does not divide) , i.e., is (or is not) an integer multiple of .

## 4 Convergence Analysis

We begin with a description of the assumptions made in our convergence analysis. Then, we present our results for HF-SGD with fixed client groupings. These will be followed by our results for HF-SGD with random grouping. The analysis presented in this work characterizes the convergence behavior of HF-SGD and clearly demonstrates its advantage over SF-SGD.

### 4.1 Assumptions

###### Assumption 1.

In this paper, we make the following assumptions.

 ∥∇Fj(w)−∇Fj(w′)∥≤L∥w−w′∥,∀j,w,w′ (4)

where is some positive constant.

b) Bounded variance

 (5)

where is the dataset at client .

c) Bounded global divergence (HF-SGD)

 1NN∑i=1∥∇fi(w)−∇f(w)∥2≤ϵ2,∀i,w (6)

d) Bounded local divergence (HF-SGD)

 1ni∑j∈Vi∥∇Fj(w)−∇fi(w)∥2≤ϵ2i,∀i,w (7)

Note that the Lipschitz gradient assumption also applies to the group objective and overall objective . For example, .

For the special case of , all clients form a single group and HF-SGD reduces to SF-SGD. In this case, we can re-write (7) as the following.

###### Assumption 2.

Bounded divergence (SF-SGD)

 1n∑j∈V∥∇Fj(w)−∇f(w)∥2≤~ϵ2,∀w. (8)

To characterize the impact of client grouping in HF-SGD, instead of assuming (8), we assume (6) and (7) for bounded global and local divergences, respectively. In particular, the global divergence corresponds to the relationship between the gradients of intra-group objectives (defined on clients’ data within each group) and the gradient of the overall objective (defined on all clients’ data). A larger global divergence means a more heterogeneous data distribution between groups. Similarly, the local divergence characterizes the data heterogeneity of clients within each group .

### 4.2 HF-SGD with Fixed Client Grouping

In the following, we will present our main convergence results when the client grouping is fixed.

###### Theorem 1.

Consider the problem in (1). For any fixed client grouping that satisfies Assumption 1, if the learning rate in Algorithm satisfies , then for any , we have7

 1TT−1∑t=0E∥∇f(¯wt)∥2 ≤2(f0−f∗)γT+γLσ2N2N∑i=11ni (9a) +4γ2L2[(1−1N)(1N∑Ni=11ni)2Gσ2+3G2ϵ2]1−12γ2G2L2 (9b) +(1+8γ2G2L21−12γ2G2L2)⋅4γ2L21−12γ2I2maxL2⋅1NN∑i=1[(1−1ni)Iiσ2+3I2iϵ2i] (9c)

where .

All missing proofs in this paper can be found in the appendix.

Note that the bound (9a)–(9c) can be partitioned into three parts. The two terms in (9a) correspond to the impact of non-federated SGD, which means that if we let , then according to their definitions and we have . This is the same as the non-federated SGD result by Bottou et al. [3]. It shows that our bound reduces to the non-federated SGD case when . The term in (9b) shows the impact of global (cross-group) averaging. The term in (9c) mainly corresponds to the impact of local (intra-group) averaging. Note that also appears in (9c). It can be interpreted as follows. When s are fixed and increases, the bound in (9c) becomes larger since a greater reduces the global averaging frequency.

In (9b) and (9c), it can be seen that the coefficients in front of the SGD noise involve and , while the coefficients in front of the divergences and involve and . This means that the divergences may play a more important role compared to on the convergence rate. One implication from this observation is that if is fixed, then it is desirable to choose a smaller for a bigger , and a larger for a smaller (in a way that the bound (9a)–(9c) remains unchanged), which follows the intuition.

When , we can obtain the following corollary directly from Theorem 1, which gives a slightly looser but simpler upper bound of .

###### Corollary 1.

Consider the problem in (1). For any fixed client grouping that satisfies Assumption 1, if the learning rate in Algorithm satisfies , then for any , we have

 1TT−1∑t=0E∥∇f(¯wt)∥2 ≤2(f0−f∗)γT+γLσ21N2N∑i=11ni+2Cγ2GL2(1−1N)(σ21NN∑i=11ni)+3Cγ2G2L2ϵ2 +2Cγ2L2σ2NN∑i=1(1−1ni)Ii+3Cγ2L2NN∑i=1I2iϵ2i, (10)

where .

Remark. From Corollary 1, let with , when is sufficiently large, it can be seen that

 1TT−1∑t=0E∥∇f(¯wt)∥2=O⎛⎜ ⎜⎝ ⎷1N2∑Ni=11niT⎞⎟ ⎟⎠+O(1T). (11)

If (equal number of clients per group), then we can obtain

 1TT−1∑t=0E∥∇f(¯wt)∥2=O(1√nT)+O(1T), (12)

which achieves a linear speedup compared to the case of non-federated SGD, which aligns with the observation by Yu et al. [24].

In the following corollary, we will show that (10) of Corollary 1 can be reduced to the case of SF-SGD. We use to denote the aggregation period of SF-SGD.

###### Corollary 2.

(Degenerate to SF-SGD case) Let and , from (10), we obtain

 1TT−1∑t=0E∥∇f(¯wt)∥2 ≤2(f0−f∗)γT+γLσ2n+2Cγ2L2σ2(1−1n)P+3Cγ2L2P2~ϵ2 (13)

The proof of Corollary 2 is straightforward by letting , , and in (10) and using (8).

While (13) is similar to the bound by Yu et al. [24], we note that the third term in the last equality in (13) has an additional term of compared to the bound in Yu et al. [24]. This term can potentially make our bound tighter. Another important observation is that the techniques used to obtain this term is the key to make our HF-SGD bound (10) degenerate to the SF-SGD and non-federated SGD cases.

### 4.3 HF-SGD with Random Grouping and Comparison to SF-SGD

Built upon the results of Theorem 1 for a fixed client grouping, we further analyze the convergence performance of HF-SGD under the assumption of random grouping. The main technique used here is to first characterize the local and global divergence of HF-SGD under random grouping (see Lemmas 1 and 2 below) and relate them to the divergence of SF-SGD defined in (8). An upper bound on the averaged performance of HF-SGD is presented in Theorem 2 below, which demonstrates the advantage of HF-SGD over SF-SGD under random grouping.

For client grouping, we consider all possible grouping strategies with the constraint that (). Then, we uniformly select one grouping strategy at random. Let the random variable denote the uniformly random grouping strategy. This means that each realization of corresponds to one grouping realization. Based on the uniformly random grouping strategy , in the following, we will show that under (8) in Assumption 2, we can explicitly compute the average global and local divergences, which can be considered as a counterpart of (6) and (7) in Assumption 1, as a function of , , and .

Consider a random client in group , we can obtain that , where the expectation is over all possible grouping strategies. These lead to the following two lemmas.

###### Lemma 1.

Using the uniformly random grouping strategy , the average global divergence is given by

 ES⎡⎢⎣1NN∑i=1∥∥ ∥∥∇f(w)−Nn∑k∈Vi∇Fk(w)∥∥ ∥∥2⎤⎥⎦≤(N−1n−1)~ϵ2, (14)

where is given in (8).

###### Lemma 2.

Using the uniformly random grouping strategy , the average local divergence is given by

 ES⎡⎢⎣Nn∑k∈Vi∥∥ ∥∥∇fk(w)−Nn∑k′∈Vi∇Fk′(w)∥∥ ∥∥2⎤⎥⎦≤(1−N−1n−1)~ϵ2, (15)

where is given in (8).

Lemmas 1 and 2 show an interesting behavior of grouping. If we add the upper bounds of the average global divergence (14) and the average local divergence (15), then it equals to , which is independent of other system parameters. In Theorem 2, we will show that grouping can effectively “break” into two parts, which are “modulated” by functions of and , respectively.

For simplicity, we let each group have the same local period . Then, we can obtain the following.

###### Theorem 2.

Using the uniformly random grouping strategy , let , then we have

 ES[1TT−1∑t=0E∥∇f(¯wt)∥2] ≤2(f0−f∗)γT+γLσ2n+2Cγ2L2[(N−1n)G+(1−Nn)I]σ2 +3Cγ2L2[(N−1n−1)G2+(1−N−1n−1)I2]~ϵ2, (16)

where .

The proof follows by substituting results of the expected divergence in Lemmas 1 and 2 to the convergence bound of Corollary 1. Without additional assumptions, the bound (16) can also degenerate to SF-SGD by setting and letting .

Remark. Theorem 2 can be interpreted as follows. First, (16) explicitly shows the impact of grouping on the convergence rate of HF-SGD. For example, in the last term in (16), as mentioned earlier, the grouping approach “breaks” the divergence into two parts, whose weights are determined by the choice of , , and . One implication is that since , it is preferable to have a smaller number of groups for a large . In addition, we note that the performance of HF-SGD can be sandwiched by the performance of two SF-SGDs. In order to see this, we consider the following three scenarios: 1) SF-SGD with averaging period , 2) SF-SGD with averaging period , and 3) HF-SGD with local averaging period and global averaging period . We let them all start with the same and choose learning rate such that . We can see that all three cases have the same first two terms in (16). However, for the third and fourth terms in (16), we have

 (1−1n)I ≤(Nn−1n)G+(1−Nn)I≤(1−1n)G, (17) I2 ≤(N−1n−1)G2+(1−N−1n−1)I2≤G2. (18)

Eqs. (17) and (18) show that performance of HF-SGD can be lower and upper bounded by SF-SGD with averaging periods of and , respectively. The grouping approach can explicitly characterize how much the convergence bound moves towards the best case, i.e., SF-SGD with . Note, however, that SF-SGD with incurs the highest global communication cost, so grouping can adjust the trade-off between convergence and communication cost.

Although the discussion above is based on Theorem 2, which is for uniformly random grouping, our experiments in Section 6 show that this property of lower and upper bounded by SF-SGD also applies to fixed grouping in both IID and Non-IID cases.

## 5 Extension to Reducing Communication Cost

We introduce two approaches to reduce communication cost based on our theoretical results above. Later, we will provide experimental results in Section 6.

Using more local aggregations. In FL settings, global averaging is often more expensive than local averaging, as explained in Section 1. In this case, we can reduce the communication cost by decreasing the frequency of global averaging (i.e., increasing ) while increasing the frequency of local averaging (i.e., decreasing ). Our theoretical analysis shows that this approach does not necessarily degrade the performance of HF-SGD. In some cases, e.g., if is sufficiently small, then the convergence performance of HF-SGD can actually be improved. To see this, consider the last two terms of (16). Suppose . For a non-trivial client grouping, i.e., , if we increase to and decrease to , then one can show that the bound (16) using and can be lower than or equal to that using and . A similar behavior can also be seen for fixed grouping, as shown in the experimental results in Section 6.

Adjusting grouping strategy. Recall the divergence-related terms in Corollary 1, i.e., . Since is typically greater than , decreasing the global divergence is more effective than decreasing the local divergence in order to tighten the convergence upper bound. If the knowledge of data distribution of all clients is available, we can adjust the grouping strategy to achieve a smaller global divergence. For example, we can make the data from each group distributed like a subset sampled uniformly from the union of all clients’ datasets, which is referred to as “group-IID”. In this way, we can increase , and hence, reduce the global communication cost without affecting the convergence performance of HF-SGD.

## 6 Experiments

In this section, we validate our theoretical results with experiments to compare the performance of HF-SGD with SF-SGD. We also conduct experiments using the two approaches in Section 5 to reduce communication cost. The implementation is based on Pytorch.

### 6.1 Setup

Dataset and models. In our experiments, we train VGG-11 for the image classification tasks over the CIFAR-10 dataset which has image classes [12], where we refer to the class labels as for simplicity. In all experiments, we set the learning rate as and the SGD mini-batch size as .

Data separation. We implement four cases of data distribution. In the IID case, data are uniformly partitioned into clients. In the non-IID case, there are clients and data on each client are from only one image class. For example, the -th client only has data from the -th class (). Then, we also consider the data distribution across groups, where each group has clients and the data among clients within each group is non-IID distributed. For the “group-IID” case, the clients in each group have labels , respectively. In the “group-non-IID” case, clients in the first group have labels , respectively, while clients in the second group have labels , respectively.

### 6.2 Results

In all plots, curves with are for SF-SGD and curves with , , and are for HF-SGD.

Figure 1(a) shows the results of the IID case. In this case, we separate clients into two groups, each group with 5 clients. It can be seen that grouping has limited impact on the convergence when data on clients are IID. This is consistent with our theoretical results in Theorem 2, since IID implies that is very small. So the gain from decreasing is small. Nonetheless, the performances of different schemes align with Theorem 2. To illustrate this, we list the number of local iterations when first achieving an accuracy in Table 1. We observe that the performances of HF-SGD with global period and local period are always between SF-SGDs with period and with period . Note in this case, HF-SGD with performs only slightly better than SF-SGD with . This is because we only have two groups and 10 clients in total. In Theorem 2, the coefficient before is only in this case, which may reduce the impact of .

Figure 1(b) shows the results of the non-IID scenario. For HF-SGD with , we partition 10 clients into two groups. In Group 1, there are only data with labels while Group 2 only contains data with labels . In this case, the benefit of grouping clients is significant. Comparing HF-SGD with , and SF-SGD with , we see that by performing 4 local aggregations before global aggregation, HF-SGD can have a much improved convergence performance over SF-SGD. This is because is relatively large in the non-IID case. According to Theorem 2, by decreasing the value of slightly, the convergence bound can be significantly reduced. In addition, we can also observe that the performance of HF-SGD with global period and local period is still between that of SF-SGD with and with . Moreover, in Figure 1(b), we plot a 5-group case where we partition the same set of clients into groups. Each group has clients. We see that the performance in this case is worse than the case with only two groups. This is because a larger will make the term larger so the coefficient for becomes larger.

In Figure 3, we show the effects of grouping with group-IID strategy. In the group-IID case, data in each group are from all 10 labels. In group-non-IID case, data in each group are only from labels (see Section 6.1). From (10) in Corollary 1, it can be seen that the bound depends on . Hence, reducing can significantly improve the convergence performance since is desired to be large in general. Figure 3 also shows that the group-IID case performs as well as the group-non-IID case after reducing by half.

In Figure 4, we show that decreasing while increasing can keep the performance while reducing communication cost. We see that HF-SGD with performs similarly to SF-SGD with . That is, by allowing more local aggregations, HF-SGD can reduce the number of global communication by while maintaining a similar performance to SF-SGD.

## 7 Conclusion

This paper has studied HF-SGD with two-level parameter aggregations. Based on this framework, we have successfully answered important questions on how local averaging affects FL convergence and what is the impact of data distribution at clients on the learning process. In particular, we have provided a thorough theoretical analysis on the convergence of HF-SGD over non-IID data, under both fixed and random client grouping. Explicit comparisons of the convergence rate have been established for HF-SGD and SF-SGD, based on a novel analysis of the local and global divergences. Our theoretical analysis provides valuable insights into the design of practical HF-SGD systems, including the choice of global and local averaging periods. Different grouping strategies are considered to best utilize the divergence measures to reduce communication costs while accelerating learning convergence. Future work could analyze the effect of partial client participation in hierarchical FL and extend to more than two levels of parameter aggregation.

\counterwithin

figuresection

## Appendix A Proofs for the Fixed Grouping Case

We present the proofs of Theorem 1 and Corollary 1 in this section. Throughout the proof, we use the following inequalities frequently:

 ∥∥∥1MM∑i=1xi∥∥∥2≤1MM∑i=1∥xi∥2, (A.1)
 1MM∑i=1∥xi−¯x∥2=1MM∑i=1∥xi∥2−∥¯x∥2≤1MM∑i=1∥xi∥2, (A.2)

where and (A.1) is due to Jensen’s Inequality.

For the ease of notations, we define for any .

### a.1 Proof of Theorem 1

Although the averaged global model is not observable in the system at , here we use for analysis.

 Ef(¯wt+1)=Ef[¯wt−γ1NN∑i=11ni∑j∈Vigj(wtj)] (a)≤Ef(¯wt)−γE⟨∇f(¯wt),1NN∑i=11ni∑j∈Vigj(wtj)⟩+γ2L2E∥∥∥1NN∑i=11ni∑j∈Vigj(wtj)∥∥∥2, (A.3)

where is a proposition of Lipschitz smooth, which is shown in (4.3) of [3]. For the inner product term

 −γE⟨∇f(¯wt),1NN∑i=11ni∑j∈Vigj(wtj)⟩ (b)=−γE⟨∇f(¯wt),1NN∑i=11ni∑j∈Vi∇Fj(wtj)⟩ =γ2⎛⎝E∥∥∥∇f(¯wt)−1NN∑i=11ni∑j∈Vi∇Fj(wtj)∥∥∥2−E∥∇f(¯wt)∥2−E∥∥∥1NN∑i=11ni∑j∈Vi∇Fj(wtj)∥∥∥2⎞⎠. (A.4)

Recalling that , we note that the expectation operator in (A.1) is over all random samples , where denotes the random samples used for SGD in iteration . In step of (A.1), a conditional expectation is taken for given . We note that due to the unbiased gradient assumption, from which follows. This way of replacing the random gradient by its unbiased average through the conditional expectation is also used later in the proof, where we may not write out the conditional expectation step for compactness.

For the last term of (A.1), we have

 E∥∥∥1NN∑i=11ni∑j∈Vigj(wtj)∥∥∥2 (a)=E∥∥∥1NN∑i=11ni∑j∈Vi(gj(wtj)−∇Fj(wtj))∥∥∥2+E∥∥∥1NN∑i=11ni∑j∈Vi∇Fj(wtj)∥∥∥2 ≤1N2N∑i=11niσ2+E∥∥∥1NN∑i=11ni∑j∈Vi∇Fj(wtj)∥∥∥2, (A.5)

where holds due to .

Substitute (A.1) and (A.1) into (A.1), we have

 +γ2E∥∥∥∇f(¯wt)−1NN∑i=11ni∑j∈Vi∇Fj(wtj)∥∥∥2. (A.6)

Suppose , that is, . We can obtain

 Ef(¯wt+1)≤Ef(¯wt)+γ2L2(1N2N∑i=11ni)σ2−γ2E∥∇f(¯wt)∥2+γ2E∥∥∥∇f(¯wt)−1NN∑i=11ni∑j∈Vi∇Fj(wtj)∥∥∥2. (A.7)

Now we compute the upper bound of the last term of inequality (A.7)

 E∥∥∥∇f(¯wt)−1NN∑i=11ni∑j∈Vi∇Fj(wtj)∥∥∥2 ≤E∥∥∥∇f(¯wt)−1NN∑i=1∇fi(¯wti)+1NN∑i=1∇fi(¯wti)−1NN∑i=11ni∑j∈Vi∇Fj(wtj)∥∥∥2 ≤2E∥∥∥∇f(¯wt)−1NN∑i=1∇fi(¯wti)∥∥∥2+2E∥∥∥1NN∑i=1∇fi(¯wti)−1NN∑i=11ni∑j∈Vi∇Fj(wtj)∥∥∥2 (a)≤2L21NN∑l=1E∥¯wt−¯wtl∥2+2L21NN∑i=11ni∑k∈ViE∥¯wti−wtk∥2, (A.8)

where is due to Lipschitz gradient Assumption 1(a) in the main paper and (A.1).

Substituting (A.8) to (A.7) and rearranging the order, we have

 γ2E∥∇f(¯wt)∥2 ≤Ef(¯wt)−Ef(¯wt+1)+γ2L2(1N2N∑i=11ni)σ2+γ2[2L21NN∑l=1E∥¯wt−¯wtl∥2 (A.9)

Dividing both sides by and taking the average over time, we have

 1TT−1∑t=0E∥∇f(¯wt)∥2 ≤2γ[f(¯w0)−Ef(¯wT)]+γL(1N2N∑i=11ni)σ2+2L21TT−1∑t=01NN∑l=1E∥¯wt−¯wtl∥2+<