Local Averaging Helps: Hierarchical Federated Learning and Convergence Analysis
Abstract
Federated learning is an effective approach to realize collaborative learning among edge devices without exchanging raw data. In practice, these devices may connect to local hubs which are then connected to the global server (aggregator). Due to the (possibly limited) computation capability of these local hubs, it is reasonable to assume that they can perform simple averaging operations. A natural question is whether such local averaging is beneficial under different system parameters and how much gain can be obtained compared to the case without such averaging. In this paper, we study hierarchical federated learning with stochastic gradient descent (HFSGD) and conduct a thorough theoretical analysis to analyze its convergence behavior. The analysis demonstrates the impact of local averaging precisely as a function of system parameters. Due to the higher communication cost of global averaging, a strategy of decreasing the global averaging frequency and increasing the local averaging frequency is proposed. Experiments validate the proposed theoretical analysis and the advantages of hierarchical federated learning.
Preprint. Work in progress.
1 Introduction
Federated learning (FL) is an emerging distributed machine learning technique that enables learning at a large scale from decentralized data generated by client devices [19, 11, 8, 14, 21, 23]. In particular, it has been proposed to be applied to wireless edge and mobile networks, where a massive number of devices collaborate together to perform a learning task without exchanging raw data [14, 13, 7, 20]. Compared to conventional machine learning techniques where data need to be transmitted to and stored at a central location, the benefits of FL include improved data privacy and reduced communication and storage requirements.
In a typical FL system, a server in the cloud acts as a parameter aggregator that receives iterative updates from clients distributed over a large geographical area. After receiving the updates, the server aggregates (averages) the parameters and sends the result back to clients for the next iteration. The challenge with this approach is that it includes a lot of communication over widearea network (WAN), which is prone to unpredictable latency and packet loss especially in areas with poor network connections. This can cause poorly connected areas to be unfairly treated in the model training process, so that the FL result is in favor of data originated from wellconnected clients.
To resolve this problem, we present a hierarchical FL technique in this paper, where two “levels” of parameter aggregation functionalities are distributed across the networked system, as shown in Figure 1. In particular, local averaging
To support hierarchical FL with local averaging, a fundamental question is: how does local averaging affect FL convergence and what is the impact of data distribution at FL clients on this process? In this paper, we address this question by conducting a thorough theoretical analysis with empirical validations. Our main contributions are summarized as follows.

We present a hierarchical FL method based on stochastic gradient descent (SGD) (referred to as HFSGD) that leverages local averaging within groups of clients.

We theoretically analyze the convergence bound of HFSGD for both fixed client grouping and uniformly random client grouping, with nonindependent/nonidentically distributed (nonIID) data across the clients and groups. This is the first work to analyze the effect of client grouping by considering both local and global divergences in HFSGD. Our analytical bound clearly demonstrates the advantage of HFSGD over singlelevel federated SGD (SFSGD).

Our theoretical results lead to useful insights such as how local and global averaging periods (intervals) and data heterogeneity affect the convergence of HFSGD. These are validated with experiments on real datasets. Our experiments show that HFSGD brings the benefit of accelerated learning convergence while reducing the communication cost of global averaging.
2 Related Work
FL was first proposed by McMahan et al. [19] as a federated averaging (FedAvg) algorithm. It was later shown that FedAvg converges with nonIID client data [24, 15]. The nonIID data distribution at clients is one of the major challenges in FL, which is also known as data heterogeneity.
To improve the communication efficiency of FL, methods for adapting the parameter averaging period (interval) for efficient learning were proposed [23, 22]. Model/gradient sparsification and compression techniques were also applied to FL, which reduces the amount of information exchanged between clients and the server while maintaining a similar model accuracy [11, 9, 10, 6]. In addition, partial client participation can further reduce the computation and communication overhead [2], which, however, still requires the participating clients to communicate with the server directly and the convergence rate is inversely proportional to the number of participating clients [15]. These techniques are orthogonal to our work and may be applied together with our proposed hierarchical FL.
By moving parameter aggregation from the server down to the clients, decentralized SGD was proposed by Lian et al. [16, 17]. However, it does not address the common FL scenario where an aggregation is only performed once after multiple local iterations for communication efficiency, and does not support a hierarchical setup with aggregation at local hubs either. Some recent works have considered similar settings as hierarchical FL. Chen et al. [5] studied sequential updates from each group, which is different from our HFSGD method that allows parallel updates with two levels of (local and global) aggregation. Independent and identically distributed (IID) data was considered in the work by Castiglia et al. [4], which does not capture the impact of data heterogeneity. The result by Liu et al. [18] has a bound that is exponential in the parameter aggregation period (interval) and can become loose if the period is large.
In this paper, different from the above existing works, we derive the convergence bound of hierarchical FL (more specifically, HFSGD) that captures the impact of data heterogeneity within and across groups of clients, based on which useful insights are observed. In addition, our bound is tight and degenerates to the best known results for SFSGD and nonfederated (i.e., centralized) SGD. Our analysis is challenging due to the complex interplay between local aggregation (within a group) and global aggregation (across groups), data heterogeneity, and convergence.
3 Hierarchical Federated Stochastic Gradient Descent (HFSGD)
In classical FedAvg [19], a set of clients collaboratively minimize the overall empirical risk in an iterative manner. After a number of local SGD iterations (referred to as a local period) computed on each client’s local data, the resulting model parameters are sent to the server, which performs an averaging operation to obtain an intermediate global model with parameter vector . Then, is sent to all clients, and the FL process continues until convergence. As mentioned earlier, we refer to this classical FedAvg method as SFSGD, to differentiate from our HFSGD technique.
In our hierarchical FL system (see Figure 1), all clients are partitioned into groups . The number of clients in each group is denoted by ().
The grouping can be a natural result of clients connecting to different local hubs, where those clients that are connected to a common hub can be considered as one group. In this case, grouping is fixed based on the network topology.
In another example, grouping can be based on the dynamic configuration of parameter aggregation functions and routing in an NFVsupported system, which is largely configurable at a logical level and allows strategies such as random grouping.
In this paper, our main goal is to characterize the impact of grouping and show its advantages.
For a given grouping,
we can write the objective function as
(1) 
where is the model parameter vector and is the empirical risk function defined over the local dataset at client ( is the loss of data sample ). In this paper, we consider a general that is nonconvex. Then, for each group , we define the intragroup objective function as
(2) 
During each local iteration , each client updates its own model using SGD:
(3) 
where is the learning rate, is the stochastic gradient of , and represents random data samples from the local dataset at client . We assume that .
Instead of aggregating all local models at the global aggregator (server) at the end of a local period as in SFSGD, models are first averaged within group () after every local iterations. In particular, at local iteration , we compute . This can be done at a local aggregator (e.g., a computational component in close proximity) for group . After several rounds of “intragroup” averagings, models from the groups are averaged globally. Let a global aggregation be performed for every local iterations. Then, at local iteration , we have for . Note that we assume that all the clients perform synchronous updates and we let be a common multiple of . Therefore, FL is conducted both at the local level within a group and the global level across groups. This design is referred to as HFSGD, which is summarized in Algorithm 1, where (or ) denotes that divides (or does not divide) , i.e., is (or is not) an integer multiple of .
4 Convergence Analysis
We begin with a description of the assumptions made in our convergence analysis. Then, we present our results for HFSGD with fixed client groupings. These will be followed by our results for HFSGD with random grouping. The analysis presented in this work characterizes the convergence behavior of HFSGD and clearly demonstrates its advantage over SFSGD.
4.1 Assumptions
Assumption 1.
In this paper, we make the following assumptions.
a) Lipschitz gradient
(4) 
where is some positive constant.
b) Bounded variance
(5) 
where is the dataset at client .
c) Bounded global divergence (HFSGD)
(6) 
d) Bounded local divergence (HFSGD)
(7) 
Note that the Lipschitz gradient assumption also applies to the group objective and overall objective . For example, .
For the special case of , all clients form a single group and HFSGD reduces to SFSGD. In this case, we can rewrite (7) as the following.
Assumption 2.
Bounded divergence (SFSGD)
(8) 
To characterize the impact of client grouping in HFSGD, instead of assuming (8), we assume (6) and (7) for bounded global and local divergences, respectively. In particular, the global divergence corresponds to the relationship between the gradients of intragroup objectives (defined on clients’ data within each group) and the gradient of the overall objective (defined on all clients’ data). A larger global divergence means a more heterogeneous data distribution between groups. Similarly, the local divergence characterizes the data heterogeneity of clients within each group .
4.2 HFSGD with Fixed Client Grouping
In the following, we will present our main convergence results when the client grouping is fixed.
Theorem 1.
All missing proofs in this paper can be found in the appendix.
Note that the bound (9a)–(9c) can be partitioned into three parts. The two terms in (9a) correspond to the impact of nonfederated SGD, which means that if we let , then according to their definitions and we have . This is the same as the nonfederated SGD result by Bottou et al. [3]. It shows that our bound reduces to the nonfederated SGD case when . The term in (9b) shows the impact of global (crossgroup) averaging. The term in (9c) mainly corresponds to the impact of local (intragroup) averaging. Note that also appears in (9c). It can be interpreted as follows. When s are fixed and increases, the bound in (9c) becomes larger since a greater reduces the global averaging frequency.
In (9b) and (9c), it can be seen that the coefficients in front of the SGD noise involve and , while the coefficients in front of the divergences and involve and . This means that the divergences may play a more important role compared to on the convergence rate. One implication from this observation is that if is fixed, then it is desirable to choose a smaller for a bigger , and a larger for a smaller (in a way that the bound (9a)–(9c) remains unchanged), which follows the intuition.
When , we can obtain the following corollary directly from Theorem 1, which gives a slightly looser but simpler upper bound of .
Corollary 1.
Remark. From Corollary 1, let with , when is sufficiently large, it can be seen that
(11) 
If (equal number of clients per group), then we can obtain
(12) 
which achieves a linear speedup compared to the case of nonfederated SGD, which aligns with the observation by Yu et al. [24].
In the following corollary, we will show that (10) of Corollary 1 can be reduced to the case of SFSGD. We use to denote the aggregation period of SFSGD.
Corollary 2.
(Degenerate to SFSGD case) Let and , from (10), we obtain
(13) 
While (13) is similar to the bound by Yu et al. [24], we note that the third term in the last equality in (13) has an additional term of compared to the bound in Yu et al. [24]. This term can potentially make our bound tighter. Another important observation is that the techniques used to obtain this term is the key to make our HFSGD bound (10) degenerate to the SFSGD and nonfederated SGD cases.
4.3 HFSGD with Random Grouping and Comparison to SFSGD
Built upon the results of Theorem 1 for a fixed client grouping, we further analyze the convergence performance of HFSGD under the assumption of random grouping. The main technique used here is to first characterize the local and global divergence of HFSGD under random grouping (see Lemmas 1 and 2 below) and relate them to the divergence of SFSGD defined in (8). An upper bound on the averaged performance of HFSGD is presented in Theorem 2 below, which demonstrates the advantage of HFSGD over SFSGD under random grouping.
For client grouping, we consider all possible grouping strategies with the constraint that (). Then, we uniformly select one grouping strategy at random. Let the random variable denote the uniformly random grouping strategy. This means that each realization of corresponds to one grouping realization. Based on the uniformly random grouping strategy , in the following, we will show that under (8) in Assumption 2, we can explicitly compute the average global and local divergences, which can be considered as a counterpart of (6) and (7) in Assumption 1, as a function of , , and .
Consider a random client in group , we can obtain that , where the expectation is over all possible grouping strategies. These lead to the following two lemmas.
Lemma 1.
Using the uniformly random grouping strategy , the average global divergence is given by
(14) 
where is given in (8).
Lemma 2.
Using the uniformly random grouping strategy , the average local divergence is given by
(15) 
where is given in (8).
Lemmas 1 and 2 show an interesting behavior of grouping. If we add the upper bounds of the average global divergence (14) and the average local divergence (15), then it equals to , which is independent of other system parameters. In Theorem 2, we will show that grouping can effectively “break” into two parts, which are “modulated” by functions of and , respectively.
For simplicity, we let each group have the same local period . Then, we can obtain the following.
Theorem 2.
Using the uniformly random grouping strategy , let , then we have
(16) 
where .
The proof follows by substituting results of the expected divergence in Lemmas 1 and 2 to the convergence bound of Corollary 1. Without additional assumptions, the bound (16) can also degenerate to SFSGD by setting and letting .
Remark. Theorem 2 can be interpreted as follows. First, (16) explicitly shows the impact of grouping on the convergence rate of HFSGD. For example, in the last term in (16), as mentioned earlier, the grouping approach “breaks” the divergence into two parts, whose weights are determined by the choice of , , and . One implication is that since , it is preferable to have a smaller number of groups for a large . In addition, we note that the performance of HFSGD can be sandwiched by the performance of two SFSGDs. In order to see this, we consider the following three scenarios: 1) SFSGD with averaging period , 2) SFSGD with averaging period , and 3) HFSGD with local averaging period and global averaging period . We let them all start with the same and choose learning rate such that . We can see that all three cases have the same first two terms in (16). However, for the third and fourth terms in (16), we have
(17)  
(18) 
Eqs. (17) and (18) show that performance of HFSGD can be lower and upper bounded by SFSGD with averaging periods of and , respectively. The grouping approach can explicitly characterize how much the convergence bound moves towards the best case, i.e., SFSGD with . Note, however, that SFSGD with incurs the highest global communication cost, so grouping can adjust the tradeoff between convergence and communication cost.
5 Extension to Reducing Communication Cost
We introduce two approaches to reduce communication cost based on our theoretical results above. Later, we will provide experimental results in Section 6.
Using more local aggregations. In FL settings, global averaging is often more expensive than local averaging, as explained in Section 1. In this case, we can reduce the communication cost by decreasing the frequency of global averaging (i.e., increasing ) while increasing the frequency of local averaging (i.e., decreasing ). Our theoretical analysis shows that this approach does not necessarily degrade the performance of HFSGD. In some cases, e.g., if is sufficiently small, then the convergence performance of HFSGD can actually be improved. To see this, consider the last two terms of (16). Suppose . For a nontrivial client grouping, i.e., , if we increase to and decrease to , then one can show that the bound (16) using and can be lower than or equal to that using and . A similar behavior can also be seen for fixed grouping, as shown in the experimental results in Section 6.
Adjusting grouping strategy. Recall the divergencerelated terms in Corollary 1, i.e., . Since is typically greater than , decreasing the global divergence is more effective than decreasing the local divergence in order to tighten the convergence upper bound. If the knowledge of data distribution of all clients is available, we can adjust the grouping strategy to achieve a smaller global divergence. For example, we can make the data from each group distributed like a subset sampled uniformly from the union of all clients’ datasets, which is referred to as “groupIID”. In this way, we can increase , and hence, reduce the global communication cost without affecting the convergence performance of HFSGD.
6 Experiments
In this section, we validate our theoretical results with experiments to compare the performance of HFSGD with SFSGD. We also conduct experiments using the two approaches in Section 5 to reduce communication cost. The implementation is based on Pytorch.
6.1 Setup
Dataset and models. In our experiments, we train VGG11 for the image classification tasks over the CIFAR10 dataset which has image classes [12], where we refer to the class labels as for simplicity. In all experiments, we set the learning rate as and the SGD minibatch size as .
Data separation. We implement four cases of data distribution. In the IID case, data are uniformly partitioned into clients. In the nonIID case, there are clients and data on each client are from only one image class. For example, the th client only has data from the th class (). Then, we also consider the data distribution across groups, where each group has clients and the data among clients within each group is nonIID distributed. For the “groupIID” case, the clients in each group have labels , respectively. In the “groupnonIID” case, clients in the first group have labels , respectively, while clients in the second group have labels , respectively.
6.2 Results
In all plots, curves with are for SFSGD and curves with , , and are for HFSGD.
Figure 1(a) shows the results of the IID case. In this case, we separate clients into two groups, each group with 5 clients. It can be seen that grouping has limited impact on the convergence when data on clients are IID. This is consistent with our theoretical results in Theorem 2, since IID implies that is very small. So the gain from decreasing is small. Nonetheless, the performances of different schemes align with Theorem 2. To illustrate this, we list the number of local iterations when first achieving an accuracy in Table 1. We observe that the performances of HFSGD with global period and local period are always between SFSGDs with period and with period . Note in this case, HFSGD with performs only slightly better than SFSGD with . This is because we only have two groups and 10 clients in total. In Theorem 2, the coefficient before is only in this case, which may reduce the impact of .
Cases  

Iterations  10,700  11,200  11,300  11,400  11,700  12,200  12,500 
Figure 1(b) shows the results of the nonIID scenario. For HFSGD with , we partition 10 clients into two groups. In Group 1, there are only data with labels while Group 2 only contains data with labels . In this case, the benefit of grouping clients is significant. Comparing HFSGD with , and SFSGD with , we see that by performing 4 local aggregations before global aggregation, HFSGD can have a much improved convergence performance over SFSGD. This is because is relatively large in the nonIID case. According to Theorem 2, by decreasing the value of slightly, the convergence bound can be significantly reduced. In addition, we can also observe that the performance of HFSGD with global period and local period is still between that of SFSGD with and with . Moreover, in Figure 1(b), we plot a 5group case where we partition the same set of clients into groups. Each group has clients. We see that the performance in this case is worse than the case with only two groups. This is because a larger will make the term larger so the coefficient for becomes larger.
In Figure 3, we show the effects of grouping with groupIID strategy. In the groupIID case, data in each group are from all 10 labels. In groupnonIID case, data in each group are only from labels (see Section 6.1). From (10) in Corollary 1, it can be seen that the bound depends on . Hence, reducing can significantly improve the convergence performance since is desired to be large in general. Figure 3 also shows that the groupIID case performs as well as the groupnonIID case after reducing by half.
In Figure 4, we show that decreasing while increasing can keep the performance while reducing communication cost. We see that HFSGD with performs similarly to SFSGD with . That is, by allowing more local aggregations, HFSGD can reduce the number of global communication by while maintaining a similar performance to SFSGD.
7 Conclusion
This paper has studied HFSGD with twolevel parameter aggregations. Based on this framework, we have successfully answered important questions on how local averaging affects FL convergence and what is the impact of data distribution at clients on the learning process. In particular, we have provided a thorough theoretical analysis on the convergence of HFSGD over nonIID data, under both fixed and random client grouping. Explicit comparisons of the convergence rate have been established for HFSGD and SFSGD, based on a novel analysis of the local and global divergences. Our theoretical analysis provides valuable insights into the design of practical HFSGD systems, including the choice of global and local averaging periods. Different grouping strategies are considered to best utilize the divergence measures to reduce communication costs while accelerating learning convergence. Future work could analyze the effect of partial client participation in hierarchical FL and extend to more than two levels of parameter aggregation.
figuresection
Appendix
Appendix A Proofs for the Fixed Grouping Case
We present the proofs of Theorem 1 and Corollary 1 in this section. Throughout the proof, we use the following inequalities frequently:
(A.1) 
(A.2) 
where and (A.1) is due to Jensen’s Inequality.
For the ease of notations, we define for any .
a.1 Proof of Theorem 1
Although the averaged global model is not observable in the system at , here we use for analysis.
(A.3) 
where is a proposition of Lipschitz smooth, which is shown in (4.3) of [3]. For the inner product term
(A.4) 
Recalling that , we note that the expectation operator in (A.1) is over all random samples , where denotes the random samples used for SGD in iteration . In step of (A.1), a conditional expectation is taken for given . We note that due to the unbiased gradient assumption, from which follows. This way of replacing the random gradient by its unbiased average through the conditional expectation is also used later in the proof, where we may not write out the conditional expectation step for compactness.