1 Introduction

Abstract

We provide a new analysis of local SGD, removing unnecessary assumptions and elaborating on the difference between two data regimes: identical and heterogeneous. In both cases, we improve the existing theory and provide values of the optimal stepsize and optimal number of local iterations. Our bounds are based on a new notion of variance that is specific to local SGD methods with different data. The tightness of our results is guaranteed by recovering known statements when we plug , where is the number of local steps. The empirical evidence further validates the severe impact of data heterogeneity on the performance of local SGD.

12

1 Introduction

Modern hardware increasingly relies on the power of uniting many parallel units into one system. This approach requires optimization methods that target specific issues arising in distributed environments such as decentralized data storage. Not having data in one place implies that computing nodes have to communicate back and forth to keep moving toward the solution of the overall problem. A number of efficient first-, second-order and dual methods that are capable of reducing the communication overhead existed in the literature for a long time, some of which are in certain sense optimal.

Yet, when Federated Learning (FL) showed up, it turned out that the problem of balancing the communication and computation had not been solved. On the one hand, Minibatch Stochastic Gradient Descent (SGD), which averages the result of stochastic gradient steps computed in parallel on several machines, again demonstrated its computation efficiency. Seeking communication efficiency, Konečný et al. [2016], McMahan et al. [2017] proposed to use a natural variant of Minibatch SGD—Local SGD (Algorithm 1), which does a few SGD iterations locally on each involved node and only then computes the average. This approach saves a lot of time on communication, but, unfortunately, in terms of theory things were not as great as in terms of practice and there are still gaps in our understanding of Local SGD.

The idea of local SGD in fact is not recent, it traces back to the work of Mangasarian [1995] and has since been popular among practitioners from different communities. An asymptotic analysis can be found in Mangasarian [1995] and quite a few recent papers proved new convergence results, making the bounds tighter with every work. The theory has been developing in two important regimes: identical and heterogeneous data.

The identical data regime is more of interest if the data are actually stored in one place. In that case, we can access it on each computing device at no extra cost and get a fast, scalable method. Although not very general, this framework is already of interest to a wide audience due to its efficiency in training large-scale machine learning models [Lin et al., 2020]. The first contribution of our work is to provide the fastest known rate of convergence for this regime under weaker assumptions than in prior work.

Federated learning, however, is done on a very large number of mobile devices, and is operating in a highly non-i.i.d. regime. To address this, we present the first analysis of Local SGD that applies to arbitrarily heterogeneous data, while all previous works assumed a certain type of similarity between the data or local gradients.

0:  Stepsize , initial vector for all , synchronization timesteps .
1:  for  do
2:     for  in parallel do
3:        Sample .
4:        if data is identical then
5:           Compute such that .
6:        else
7:           Compute such that .
8:        end if
9:        
10:     end for
11:  end for
Algorithm 1 Local SGD

To explain the challenge of heterogeneity better, let us introduce the problem we are trying to solve. Given that there are devices and corresponding local losses , we want to find

(1)

In the case of identical data, we are able to obtain on each node an unbiased estimate of the gradient . In the case of heterogeneous data, -th node can only obtain an unbiased estimate of the gradient . Data similarity can then be formulated in terms of the differences between functions . If the underlying data giving rise to the loss functions are i.i.d., the function share optima and one could even minimize them separately, averaging the results at the end. We will demonstrate this rigorously later in the paper.

If the data are dissimilar, however, we need to be much more careful since running SGD locally will yield solutions of local problems. Clearly, their average might not minimize the true objective (1), and this poses significant issues for the convergence of Local SGD.

To properly discuss the efficiency of local SGD, we also need a practical way of quantifying it. Normally, a method’s efficiency is measured by the total number of times each function is touched and the cost of the touches. On the other hand, in distributed learning we also care about how much information each computing node needs to communicate. In fact, when communication is as expensive as is the case in FL, we predominantly care about communication. The question we address in this paper, thus, can be posed as follows: how many times does each node need to communicate if we want to solve (1) up to accuracy ? Equivalently, we can ask for the optimal synchronization interval length between communications, , i.e. how many computation steps per one communication we can allow for. We next review related work and then present our contributions.

[b] Unbounded gradient convergent a strongly convex convex Reference Stich, 5/2018 Basu et al., 6/2018 Stich and Karimireddy, 9/2019 b - Haddadpour et al., 10/2019 THIS WORK, 9/2019-1/2020

  • denotes the minimum number of communication steps required at each of iterations to achieve a linear speedup in the number of nodes .

  • The PL inequality, a generalization of strong convexity, is assumed in [Haddadpour et al., 2019], but for comparison we specialize to strong convexity.

Table 1: Existing theoretical bounds for local SGD for identical data with convex objectives.

2 Related Work

While local SGD has been used among practitioners for a long time, see e.g. [Coppola, 2015, McDonald et al., 2010], its theoretical analysis has been limited until recently. Early theoretical work on the convergence of local methods exists as in [Mangasarian, 1995], but no convergence rate was given there. The previous work can mainly be divided into two groups: those assuming identical data (that all nodes have access to the same dataset) and those that allow each node to hold its own dataset. As might be expected, the analysis in the latter case is more challenging, more limited, and usually shows worse rates. We note that in recent work more sophisticated local stochastic gradient methods have been considered, for example with momentum [Yu et al., 2019a, Wang et al., 2019], with quantization [Reisizadeh et al., 2019, Basu et al., 2019], with adaptive stepsizes [Xie et al., 2019] and with various variance-reduction methods [Liang et al., 2019, Sharma et al., 2019, Karimireddy et al., 2019]. Our work is complimentary to these approaches, and provides improved rates and analysis for the vanilla method.

2.1 Local SGD with Identical Data

The analysis of local SGD in this setting shows that a reduction in communication is possible without affecting the asymptotic convergence rate of Minibatch SGD with nodes (albeit with usually worse dependence on constants). An overview of related work on local SGD for convex objectives is given in Table 1. We note that analysis for nonconvex objectives has been carried out in a few recent works [Zhou and Cong, 2018, Wang and Joshi, 2018, Jiang and Agrawal, 2018], but our focus in this work is on convex objectives and hence they were not included in Table 1. The comparison shows that we attain superior rates in the strongly convex setting to previous work with the exception of the concurrent3 work of Stich and Karimireddy [2019] and we attain these rates under less restrictive assumptions on the optimization process compared to them. We further provide a novel analysis in the convex case, which has not been previously explored in the literature, with the exception of [Stich and Karimireddy, 2019]. Their analysis attains the same communication complexity but is much more pessimistic about possible values of . In particular, it does not recover the convergence of one-shot averaging, i.e. substituting or even gives noninformative bounds, unlike our Theorem 1.

In addition to the works listed in the table, Dieuleveut and Patel [2019] also analyze local SGD for identical data under a Hessian smoothness assumption in addition to gradient smoothness, strong convexity, and uniformly bounded variance. However, we believe that there are issues in their proof that we explain in Section 11 in the supplementary material. As a result, the work is excluded from the table.

2.2 Local SGD with Heterogeneous Data

Unbounded
gradient
Unbounded
dissimilarity/diversity
strongly convex
convex
nonconvex
Reference
- - Yu et al., 7/2018
- - Jiang and Agrawal, 12/2018
- Basu et al., 6/2019
- Haddadpour and Mahdavi, 10/2019
- - THIS WORK, 1/2020
Table 2: Existing theoretical bounds for local SGD with heterogeneous data.

An overview of related work on local SGD in this setting is given in Table 2. In addition to the works in Table 2, Wang et al. [2018] analyze a local gradient descent method under convexity, bounded dissimilarity, and bounded gradients, but do not show convergence to arbitrary precisions. Li et al. [2020] analyze federated averaging (discussed below) in the strongly convex and nonconvex cases under bounded gradient norms. However, their result is not included in Table 2 because in the more general setting of federated averaging, their analysis and experiments suggest that retaining a linear speedup is not possible.

Local SGD is at the core of the Federated Averaging algorithm which is popular in federated learning applications [Konečný et al., 2016]. Essentially, Federated Averaging is a variant of Local SGD with participating devices sampled randomly. This algorithm has been used in several machine learning applications such as mobile keyboard prediction [Hard et al., 2018], and strategies for improving its communication efficiency were explored in [Konečný et al., 2016]. Despite its empirical success, little is known about convergence properties of this method and it has been observed to diverge when too many local steps are performed [McMahan et al., 2017]. This is not so surprising as the majority of common assumptions are not satisfied; in particular, the data are typically very non-i.i.d. [McMahan et al., 2017], so the local gradients can point in different directions. This property of the data can be written for any vector and indices as

Unfortunately, it is very hard to analyze local methods without assuming a bound on the dissimilarity of and . For this reason, almost all prior work assumed some regularity notion over the functions such as bounded dissimilarity [Yu et al., 2019a, Li et al., 2020, Yu et al., 2019b, Wang et al., 2018] or bounded gradient diversity [Haddadpour and Mahdavi, 2019] and addressed other less challenging aspects of federated learning such as decentralized communication, nonconvexity of the objective or unbalanced data partitioning. In fact, a common way to make the analysis simple is to assume Lipschitzness of local functions, for any and . We argue that this assumption is pathological and should be avoided when seeking a meaningful convergence bound. First of all, in unconstrained strongly convex minimization this assumption can not be satisfied, making the analysis in works like [Stich, 2019] questionable. Second, there exists at least one method, whose convergence is guaranteed under bounded variance [Juditsky et al., 2011], but in practice the method diverges [Chavdarova et al., 2019, Mishchenko et al., 2019]. Finally, under the bounded gradients assumption we have

In other words, we lose control over the difference between the functions. Since bounds not just dissimilarity, but also the gradients themselves, it makes the statements less insightful or even vacuous. For instance, it is not going to be tight if the data are actually i.i.d. since in that case will remain a positive constant. In contrast, we will show that the rate should depend on a much more meaningful quantity,

where is a fixed minimizer of and for are stochastic realizations of (see the next section for the setting). Obviously, for all nondegenerate sampling distributions the quantity is finite and serves as a natural measure of variance in local methods. We note that an attempt to get more general convergence statement has been made by [Li et al., 2018], but unfortunately their guarantee is strictly worse than that of minibatch Stochastic Gradient Descent (SGD). In the overparameterized regime where , Zhang and Li [2019] prove the convergence of Local SGD with arbitrary .

Our earlier workshop paper [Khaled et al., 2019b] explicitly analyzed Local Gradient Descent (Local GD) as opposed to Local SGD, where there is no stochasticity in the gradients. An analysis of Local GD for non-convex objectives with the PL inequality and under bounded gradient diversity was subsequently carried out by Haddadpour and Mahdavi [2019].

3 Settings and Contributions

Assumption 1.

Assume that the set of minimizers of (1) is nonempty. Each is -strongly convex for and -smooth. That is, for all

When , we say that each is just convex. When , we define , the condition number.

Assumption 1 formulates our requirements on the overall objective. Next, we have two different sets of assumptions on the stochastic gradients that model different scenarios, which also lead to different convergence rates.

Assumption 2.

Given a function , a point , and a sample drawn i.i.d. according to a distribution , the stochastic gradients satisfy

Assumption 2 holds for example when for a random variable of expected bounded squared norm: . Assumption 2, however, typically does not hold for finite-sum problems where is a gradient of the one functions in the finite-sum. To capture this setting, we consider the following assumption:

Assumption 3.

Given an -smooth and -strongly convex (possibly with ) function written as an expectation , we assume that a stochastic gradient is computed by We assume that is almost-surely -smooth and -strongly convex (with the same and as ).

When Assumption 3 is assumed in the identical data setting, we assume it is satisfied on each node with and distribution , and we define as a measure of variance at the optimum

Whereas in the heterogeneous data setting we assume that it is satisfied on each node with and distribution , and we analogously define

Assumption 3 holds, for example, for finite-sum optimization problems with uniform sampling and permits direct extensions to more general settings such as expected smoothness Gower et al. [2019].

Our contributions are as follows:

  1. In the identical data setting under Assumptions 1 and 2 with , we prove that the iteration complexity of Local SGD to achieve -accuracy is

    in squared distance from the optimum provided that . This improves the communication complexity in prior work (see Table 1) with a tighter results compared to concurrent work (recovering convergence for and ). When we have that the iteration complexity of Minibatch SGD to attain an -accurate solution in functional suboptimality is

    provided that . We further show that the same -dependence holds in both the and cases under Assumption 3. This has not been explored in the literature on Local SGD before, and hence we obtain the first results that apply to arbitrary convex and smooth finite-sum problems.

  2. When the data on each node is different and Assumptions 1 and 3 hold with , the iteration complexity needed by Local SGD to achieve an -accurate solution in functional suboptimality is

    provided that . This improves upon previous work by not requiring any restrictive assumptions on the gradients and is the first analysis to capture true data heterogeneity between different nodes.

  3. We verify our results by experimenting with logistic regression on multiple datasets, and investigate the effect of heterogeneity on the convergence speed.

4 Convergence Theory

The following quantity is crucial to the analysis of both variants of local SGD, and measures the deviation of the iterates from their average over an epoch:

To prove our results, we follow the line of work started by Stich [2019] and first show a recurrence similar to that of SGD up to an error term proportional to , then we bound each term individually or the sum of ’s over an epoch. All proofs are relegated to the supplementary material.

4.1 Identical Data

Our first lemma presents a bound on the sequence of the in terms of the synchronization interval .

Lemma 1.

Choose a stepsize such that . Under Assumptions 1, and 2 we have that for Algorithm 1 with and with identical data, for all

Combining Lemma 1 with perturbed iterate analysis as in [Stich, 2019] we can recover the convergence of local SGD for strongly-convex functions:

Theorem 1.

Suppose that Assumptions 1, and 2 hold with . Then for Algorithm 1 run with identical data, a constant stepsize such that , and such that ,

(2)

By (2) we see that the convergence of local SGD is the same as Minibatch SGD plus an additive error term which can be controlled by controlling the size of , as the next corollary and the successive discussion show.

Corollary 1.

Choosing , with for and we take steps. Then substituting in (2) and using that and some algebraic manipulation we can conclude that,

where and ignores polylogarithmic and constant numerical factors.

Recovering fully synchronized Minibatch SGD. When the error term vanishes and we obtain directly the ordinary rate of Minibatch SGD.

Linear speedup in the number of nodes . We see that choosing leads to an asymptotic convergence rate of which shows the same linear speedup of Minibatch SGD but with worse dependence on . The number of communications in this case is then .

Local SGD vs Minibatch SGD. We assume that the statistical dependence dominates the dependence on the initial distance . From Corollary 1, we see that in order to achieve the same convergence guarantees as Minibatch SGD, we must have , achieving a communication complexity of . This is only possible when . It follows that given a number of steps the optimal is achieving a communication complexity of .

One-shot averaging. Putting yields a convergence rate of , showing no linear speedup but showing convergence, which improves upon all previous work. However, we admit that simply using Jensen’s inequality to bound the distance of the average iterate would yield a better asymptotic convergence rate of . Under a Lipschitz Hessian assumption, Zhang et al. [2013] show that one-shot averaging can attain a linear speedup in the number of nodes, so one may do analysis of local SGD under this additional assumption to try to remove this gap, but this is beyond the scope of our work.

Similar results can be obtained for weakly convex functions, as the next Theorem shows.

Theorem 2.

Suppose that Assumptions 1, 2 hold with and that a constant stepsize such that and is chosen and that Algorithm 1 is run for identical data with such that , then for ,

(3)

Theorem 2 essentially tells the same story as Theorem 1: convergence of local SGD is the same as Minibatch SGD up to an additive constant whose size can be controlled by controlling .

Corollary 2.

Assume that . Choosing , then substituting in (3) we have,

Linear speedup and optimal . From Corollary 2 we see that if we choose then we obtain a linear speedup, and the number of communication steps is then , and we get that the optimal is then .

The previous results were obtained under Assumption 2. Unfortunately, this assumption does not easily capture the finite-sum minimization scenario where and each stochastic gradient is sampled uniformly at random from the sum.

Using smaller stepsizes and more involved proof techniques, we can show that our results still hold in the finite-sum setting. For strongly-convex functions, the next theorem shows that the same convergence guarantee as Theorem 1 can be attained.

Figure 1: The effect of the dataset and number of workers on the variance parameters. Left: ‘a8a’, middle: ‘mushrooms’, right: ‘w8a’ dataset. We use uniform sampling of data points, so is the same as with , while for higher values of the value of might be drastically larger than .
Theorem 3.

Suppose that Assumptions 1 and 3 hold with . Suppose that Algorithm 1 is run for identical data with for some and with a stepsize chosen such that Then for any timestep such that synchronization occurs,

(4)

As a corollary, we can obtain an asymptotic convergence rate by choosing specific stepsizes and .

Corollary 3.

Let for some , let and choose . We substitute in (4) and take steps, then for ,

Substituting in Corollary 3 we get an asymptotic convergence rate of . This preserves the rate of minibatch SGD up to problem-independent constants and polylogarithmic factors, but with possibly fewer communication steps.

Theorem 4.

Suppose that Assumptions 1 and 3 hold with , that a stepsize is chosen and that Algorithm 1 is run on nodes with identical data and with , then for any timestep such that synchronization occurs we have for that

(5)
Corollary 4.

Let , then for we see that , and plugging it into (5) yields

This is the same result as Corollary 2, and hence we see that choosing (when ) yields a linear speedup in the number of nodes .

4.2 Heterogeneous Data

We next show that for arbitrarily heterogeneous convex objectives, the convergence of Local SGD is the same as Minibatch SGD plus an error that depends on .

Theorem 5.

Suppose that Assumptions 1 and 3 hold with and for heterogeneous data. Then for Algorithm 1 run for different data with , , and a stepsize such that , then we have

where and .

Dependence on . We see that the convergence guarantee given by Theorem 5 shows a dependence on , which measures the heterogeneity of the data distribution. In typical (non-federated) distributed learning settings where data is distributed before starting training, this term can very quite significantly depending on how the data is distributed.

Dependence on . We further note that the dependence on in Theorem 5 is quadratic rather than linear. This translates to a worse upper bound on the synchronization interval that still allows convergence, as the next corollary shows.

Figure 2: Results on ‘a9a’ dataset, with stepsize . For any value of local iterations the method converged to a neighborhood within a small number of communication rounds due to large stepsizes.
Figure 3: Convergence on heterogeneous data with different number of local steps on the ‘a5a’ dataset. 1 local step corresponds to fully synchronized gradient descent. Left: convergence in terms of communication rounds, which shows a clear advantage of local GD when only limited accuracy is required. Mid plot: wall-clock time might improve only slightly if communication is cheap. Right: what changes with different communication cost.
Corollary 5.

Choose , then , and hence applying the result of Theorem 5,

Optimal . By Corollary 5 we see that the optimal value of is , which gives convergence rate. Thus, the same convergence rate is attained provided that communication is more frequent compared to the identical data regime.

5 Experiments

All experiments described below were run on logistic regression problem with regularization of order . The datasets were taken from the LIBSVM library [Chang and Lin, 2011]. The code was written in Python using MPI [Dalcin et al., 2011] and run on Intel(R) Xeon(R) Gold 6146 CPU @3.20GHz cores in parallel.

5.1 Variance measures

We provide values of and in Figure 1 for different datasets, minibatch sizes and . The datasets were split evenly without any data reshuffling and no overlaps. For any , the value of is lower bounded by which explains the difference between identical and heterogeneous data.

5.2 Identical Data

For identical data we used nodes and ’a9a’ dataset. We estimated numerically and ran two experiments, with stepsizes and and minibatch size equal 1. In both cases we observe convergence to a neighborhood, although of a different radius. Since we run the experiments on a single machine, the communication is very cheap and there is little gain in time required for convergence. However, the advantage in terms of required communication rounds is self-evident and can lead to significant time improvement under slow communication networks. The results are provided here in Figure 2 and in the supplementary material in Figure 5.

5.3 Heterogeneous Data

Figure 4: Convergence of local SGD on heterogeneous data with different number of local steps on the ‘a5a’ dataset.

Since our architecture leads to a very specific trade-off between computation and communication, we provide plots for the case the communication time relative to gradient computation time is higher or lower. To see the impact of , in all experiments we use full gradients and constant stepsize . The data partitioning is not i.i.d. and is done based on the index in the original dataset. The results are provided in Figure 3 and in the supplementary material in Figure 6. In cases where communication is significantly more expensive than gradient computation, local methods are much faster for imprecise convergence.

Supplementary Material

6 Basic Facts and Notation

We use a notation similar to that of Stich [2019] and denote the sequence of time stamps when synchronization happens as . Given stochastic gradients at time we define

We define an epoch to be a sequence of timesteps between two synchronizations: for an epoch is the sequence . We summarize some of the notation used in Table 3.

Symbol Description
Stochastic gradient at time on node . See Algorithm 1.
Local iterate at time on node . See Algorithm 1.
Average of stochastic gradients across nodes
at time . See Algorithm 1.
Expected value of : .
The average of all local iterates at time .
The deviation of the average iterate from the optimum at time .
Uniform bound on the variance of the stochastic gradients
for identical data. See Assumption 2.
The variance of the stochastic gradients at the optimum
for identical data. See Assumption 3.
The variance of the stochastic gradients at the optimum
for heterogeneous data. See Assumption 3.
Timesteps at which synchronization happen in Algorithm 1.
Upper bound on the maximum number of local computations
between timesteps, i.e. .
Table 3: Common Notation Summary.

Throughout the proofs, we will use the variance decomposition that holds for any random vector with finite second moment:

(6)

In particular, its version for vectors with finite number of values gives

(7)

As a consequence of (6) we have that,

(8)
Proposition 1 (Jensen’s inequality).

For any convex function and any vectors we have

(9)

As a special case with , we obtain

(10)

We denote the Bregman divergence associated with function and arbitrary as

Proposition 2.

If is -smooth and convex, then for any and it holds

(11)

If satisfies Assumption 1, then

(12)

We will also use the following facts from linear algebra:

(13)
(14)

7 Proofs for Identical data under Assumption 2

7.1 Proof of Lemma 1

Proof.

Let be such that . Recall that for a time such that we have and . Hence for the expectation conditional on we have:

Averaging both sides and letting , we have

(15)

Now note that by expanding the square we have,

(16)

We decompose the first term in the last equality again by expanding the square,

Plugging this into (16) we have,

Now average over :

where we used that by definition . Hence,

(17)

Now note that for the first term in (17) we have by Assumption 2,

(18)

For the second term in (17) we have

Averaging over ,

where we used the fact that , which comes from the linearity of expectation. Now we bound in the last inequality by smoothness and then use that Jensen’s inequality implies ,

(19)

Plugging in (19) and (18) into (17) we have,

(20)

Plugging (20) into (15), we get

(21)

Using that we can conclude,

Taking expectations and iterating the above inequality,

It remains to notice that by assumption we have . ∎

7.2 Two More Lemmas

Lemma 2.

[Stich, 2019]. Let be iterates generated by Algorithm 1 run with identical data. Suppose that satisfies Assumption 1 and that . Then,

(22)
Proof.

This is Lemma 3.1 in [Stich, 2019]. ∎