1 Introduction
Abstract

Federated learning enables training on a massive number of edge devices. To improve flexibility and scalability, we propose a new asynchronous federated optimization algorithm. We prove that the proposed approach has near-linear convergence to a global optimum, for both strongly and non-strongly convex problems, as well as a restricted family of non-convex problems. Empirical results show that the proposed algorithm converges fast and tolerates staleness.

\sysmltitlerunning

Asynchronous Federated Optimization

\sysmltitle

Asynchronous Federated Optimization

{sysmlauthorlist}\sysmlauthor

Cong Xieuiuc \sysmlauthorOluwasanmi Koyejouiuc \sysmlauthorIndranil Guptauiuc

\sysmlaffiliation

uiucDepartment of Computer Science, University of Illinois Urbana-Champaign, Illinois, USA

\sysmlcorrespondingauthor

Cong Xiecx2@illinois.edu \sysmlcorrespondingauthorOluwasanmi Koyejosanmi@illinois.edu \sysmlcorrespondingauthorIndranil Guptaindy@illinois.edu

\sysmlkeywords

Machine Learning, SysML


\printAffiliationsAndNotice\sysmlEqualContribution

1 Introduction

Federated learning (Konevcnỳ et al., 2016; McMahan et al., 2016) enables training a global model on datasets decentrally located on a massive number of resource-weak edge devices. Federated learning is motivated by the massive data generated in our daily life, by edge devices/IoT such as smart phones, wearable devices, sensors, and in smart homes/buildings. Ideally, the larger amounts of training data from diverse users results in improved representation and generalization of machine-learning models. Federated learning is also motivated by the need for privacy preservation. In some scenarios, on-device training without depositing data in the cloud is legally required by regulations such as US HIPAA laws (HealthInsurance.org, 1996) and Europe’s GDPR law (EU, 2018).

Typically, a federated learning system is composed of servers and workers, whose architecture is similar to parameter servers (Li et al., 2014a, b; Ho et al., 2013). The workers train the models locally on the private data on edge devices. The servers aggregate the learned models from the workers, and produce a global model on the cloud/datacenter. To protect the users’ privacy, the workers do not expose the training data to the servers, and instead only expose the trained model.

We summarize the key properties of federated learning below:

  • Infrequent task scheduling. Edge devices typically have weak computational capacity and limited battery time. Unlike the traditional distributed machine learning, on-device federated learning tasks are allowed to be executed only when the device is idle, charging, and connected to unmetered networks (i.e., WiFi) (Bonawitz et al., 2019). The edge devices will ping the servers when they are ready to execute training tasks. The servers will then schedule training tasks on available edge devices. Furthermore, to avoid congesting the network, the server randomizes the check-in time of the workers. As a result, on each edge device, the training task is executed infrequently.

  • Limited communication. The connection between edge devices and the remote servers may be frequently unavailable, slow, or expensive (in terms of communication costs or in the power of battery). Thus, compared to typical distributed optimization, communication in federated learning needs to be much less frequent.

  • Non-IID training data. Unlike the traditional distributed machine learning, the data on different devices are not mixed and IID, and thus represent non-identically distributed samples from the population.

We posit that the synchronous flavor of federated optimization is potentially unscalable, inefficient, and inflexible. Previous synchronous training algorithms for federated averaging (McMahan et al., 2016; Bonawitz et al., 2019) can only handle hundreds of devices in parallel, while there are nearly 4 billion mobile phones in total (eMarketer, 2019). Even at smaller scales, like a stadium during a game, there are thousands of devices involved. Too many devices checking in at the same time can congest the network on the server side. Thus, in each global epoch, the server is limited to selecting only from the subset of available devices to trigger the training tasks. Furthermore, since the task scheduling varies from device to device due to limited computational capacity and battery time, it is difficult to synchronize the selected devices at the end of each epoch. Some devices will no longer be available before synchronization. Instead, the server has to determine a timeout threshold to drop the stragglers. If the number of survived devices is too small, the server may have to drop the entire epoch including all the received updates.

To address these issues that arise in synchronous federated optimization, we propose a novel asynchronous federated optimization algorithm. The key idea is to use a weighted average to update the global model. The mixing weight can also be set adaptively as a function of the staleness. We show that taken together, these changes result in an effective asynchronous federated optimization procedure.

The main contributions of our paper are listed as follows:

  • We propose a new asynchronous federated optimization algorithm with provable convergence under non-IID settings.

  • We show that the proposed approach has near-linear convergence to a global optimum, for both strongly and non-strongly convex problems, as well as a restricted family of non-convex problems.

  • We propose strategies for controlling the error caused by asynchrony. We instroduce a mixing hyperparameter which adaptively controls the trade-off between the convergence rate and variance reduction according to the staleness.

  • We show empirically that the proposed algorithm converges fast and outperforms synchronous federated optimization.

2 Related Work

Edge computing (Garcia Lopez et al., 2015; Hong et al., 2013) is increasingly applied in various scenarios such as smart home, wearable devices, and sensor networks. Meanwhile, machine-learning applications are also moving from cloud to edge (Cao et al., 2015; Mahdavinejad et al., 2018; Zeydan et al., 2016). Typically, edge devices have weaker computation and communication capacity compared to the workstations and datacenters, due to the weak hardware, limited battery time, and metered networks. As a result, simple machine-learning models such as MobileNet (Howard et al., 2017) have been proposed for the learning tasks on weak devices.

Existing federated optimization methods (Konevcnỳ et al., 2015, 2016; McMahan et al., 2016; Bonawitz et al., 2019) focus on synchronous training. In each global epoch, training tasks are triggered on a subset of workers. However, perhaps due to the bad networking conditions and occasional issues, some worker may fail. When this happens, the server has to wait until a sufficient number of workers respond. Otherwise, the server times out, drops the current epoch, and moves on to the next epoch. As far as we know, this paper is the first to discuss asynchronous training in federated learning with provable convergence.

Asynchronous training (Zinkevich et al., 2009; Lian et al., 2017; Zheng et al., 2017) is widely used in traditional distributed SGD. Typically, asynchronous SGD converges faster than synchronous SGD, especially when the communication latency is high and heterogeneous. However, classic asynchronous SGD directly sends gradients to the servers after each local update, which is not feasible for edge devices due to unreliable and slow communication. In this paper, we take the advantage of asynchronous training, and combine it with federated optimization.

Notation/Term Description
Number of devices
Number of global epochs
Set of integers
Minimal number of local iterations
Number of local iterations in the epoch
on the th device
Global model in the epoch on server
Model initialized from , updated in
the th local iteration, on the th device
Dataset on the th device
Data (minibatch) sampled from
Learning rate
Mixing hyperparameter
Regularization weight
Staleness
Function of staleness for adaptive
All the norms in this paper are -norms
Device Where the training data are placed
Worker One worker on each device,
process that trains the model
Table 1: Notations and Terminologies.

3 Problem Formulation

We consider federated learning with devices. On each device, there is a worker process that trains the model on local data. The overall goal is to train a global model using data from all the devices.

To do so, we consider the following optimization problem:

where , for , is sampled from the local data on the th device.

Note that different devices have different local datasets, i.e., . Thus, samples drawn from different devices may have different expectations i.e. in general, .

4 Methodology

A single execution of federated optimization has global epochs. In the epoch, the server receives a locally trained model from an arbitrary worker, and updates the global model by weighted averaging:

where is the mixing hyperparameter.

On an arbitrary device , after receiving a global model (potentially stale) from the server, we locally solve the following regularized optimization problem using SGD for arbitrary number of iterations:

The server and workers conduct updates asynchronously. The server immediately updates the global model whenever it receives a local model. The communication between the server and workers is non-blocking.

The detailed algorithm is shown in Algorithm 1. The model parameter is updated in th local iteration after receiving , on the th device. is the data randomly drawn in th local iteration after receiving , on the th device. is the number of local iterations after receiving , on the th device. is the learning rate and is the total number of global epochs.

Remark 1.

On the server side, there are two threads running asynchronously in parallel: scheduler and updater. The scheduler periodically triggers training tasks on some workers. The updater receives locally trained models from workers and updates the global model. There could be multiple updater threads with read-write lock on the global model, which improves the throughput. The scheduler randomizes the timing of training tasks to avoid overloading the updater thread, and controls the staleness ( in the updater thread). We illustrate the overview of the system in

Figure 1: System overview. 0⃝: scheduler triggers training tasks through the coordinator. 1⃝, 2⃝: worker receives a delayed global model from server. 3⃝: worker does local update as described in Algorithm 1. The worker process can switch between the two states: working and idle, according to the devices’ availability. 4⃝, 5⃝, 6⃝: worker pushes the locally updated model to server via coordinator. The scheduler queues the models received in 5⃝, and feed them to the updater sequentially in 6⃝. 7⃝, 8⃝: server updates the global model and make it ready to read in the coordinator. In our system, 1⃝ and 5⃝ operates asynchronously in parallel, so that the server can trigger training tasks on devices at any time, and the devices can push the locally updated models to the server at any time.

Intuitively, larger staleness results in greater error when updating the global model. For the local models with large staleness , we decrease to mitigate the error caused by staleness. As shown in Algorithm 1, optionally, we use a function to decide the value of . We list some choices for , parameterized by :

  • Linear: .

  • Polynomial: .

  • Exponential: .

  • Hinge:

  Server Process
  Input:
  Initialize ,
  Scheduler Thread
  Scheduler periodically triggers some training tasks on some workers, and sends them the latest global model with time stamp
  Updater Thread
  for all epoch  do
     Receive the pair from any worker
     Optional: , is a function of the staleness
     
  end for
  Worker Processes
  for all  in parallel do
     If triggered by the scheduler:
     Receive the pair of the global model and its time stamp from the server
     ,
     For -weakly convex :
      Define , where
     for all local iteration  do
        Randomly sample
        
     end for
     Push to the server
  end for
Algorithm 1 Asynchronous Federated Optimization (FedAsync)

5 Convergence Analysis

In this section, we prove the convergence of Algorithm 1 with non-IID data.

5.1 Assumptions

First, we introduce some definitions and assumptions for our convergence analysis.

Definition 1.

(Smoothness) A differentiable function is -smooth if for ,

where .

Definition 2.

(Strong convexity) A differentiable function is -strongly convex if for ,

where . Note that if , is convex.

Definition 3.

(Weak convexity) A differentiable function is -weakly convex if the function with is convex, where .

Remark 2.

Note that when is -weakly convex, then is convex if , and potentially non-convex if .

Assumption 1.

(Existence of global optimum) We assume that there exists a set , where any element is a global optimum of , , and .

5.2 Convergence Guarantees

Based on the assumptions above, we have the following convergence guarantees. Detailed proofs can be found in the appendix.

Theorem 1.

Assume that the global loss function is -smooth and -strongly convex, and each worker executes at least local updates before pushing models to the server. Furthermore, we assume that for , and , we have , and . Taking , after global updates/epochs on the server, Algorithm 1 with Option I converges to a global optimum :

where .

Remark 3.

The mixing hyperparameter controls the trade-off between the convergence rate and additional error caused by variance. When , the convergence rate approaches , with the additional error :

When , . As a result, the variance is reduced to . In practice, to balance the convergence rate and the variance reduction, we use diminishing : , such that the the algorithm converges fast at the beginning, and reduces the variance at the end.

Theorem 2.

Assume that the global loss function is -smooth and -weakly convex (potentially non-convex), and each worker executes at least local updates before pushing models to the server. Furthermore, we assume that for , and , we have , and , . Taking and , after global updates/epochs on the server, Algorithm 1 with Option II converges to a global optimum :

where .

  Input:
  Initialize
  for all epoch  do
     Randomly select a group of workers, denoted as
     for all  in parallel do
        Receive the latest global model from the server
        
        for all local iteration  do
           Randomly sample
           
        end for
        Push to the server
     end for
     Update the global model:
  end for
Algorithm 2 Federated Averaging (FedAvg)
  Initialize
  for all iteration  do
     Randomly sample
     
  end for
Algorithm 3 SGD (Single Thread)
(a) Top-1 Accuracy on Testing Set,
(b) Cross Entropy on Training Set,
Figure 2: Metrics vs. # of gradients. The maximum staleness is . decays by at the th global epoch. For FedAsync+Poly, we take . For FedAsync+Hinge, we take . Note that when the maximum staleness is , FedAsync and FedAsync+Hinge with are the same.
(a) Top-1 Accuracy on Testing Set,
(b) Cross Entropy on Training Set,
Figure 3: Metrics vs. # of gradients. The maximum staleness is . decays by at the th global epoch. For FedAsync+Poly, we take . For FedAsync+Hinge, we take .
(a) Top-1 Accuracy on Testing Set,
(b) Cross Entropy on Training Set,
Figure 4: Metrics vs. # of global epochs. The maximum staleness is . decays by at the th global epoch. For FedAsync+Poly, we take . For FedAsync+Hinge, we take . Note that when the maximum staleness is , FedAsync and FedAsync+Hinge with are the same.
(a) Top-1 Accuracy on Testing Set,
(b) Cross Entropy on Training Set,
Figure 5: Metrics vs. # of global epochs. The maximum staleness is . decays by at the th global epoch. For FedAsync+Poly, we take . For FedAsync+Hinge, we take .
(a) Top-1 Accuracy on Testing Set,
(b) Cross Entropy on Training Set,
Figure 6: Metrics vs. # of communications. The maximum staleness is . decays by at the th global epoch. For FedAsync+Poly, we take . For FedAsync+Hinge, we take . Note that when the maximum staleness is , FedAsync and FedAsync+Hinge with are the same.
(a) Top-1 Accuracy on Testing Set,
(b) Cross Entropy on Training Set,
Figure 7: Metrics vs. # of communications. The maximum staleness is . decays by at the th global epoch. For FedAsync+Poly, we take . For FedAsync+Hinge, we take .
(a) Top-1 Accuracy on Testing Set
(b) Cross Entropy on Training Set
Figure 8: Metrics at the end of training (at the 2000th epoch), with different staleness. decays by at the th global epoch.
(a) Top-1 Accuracy on Testing Set,
(b) Cross Entropy on Training Set,
Figure 9: Metrics at the end of training (at the 2000th epoch), with different . The maximum staleness is . decays by at the th global epoch. For FedAsync+Poly, we take . For FedAsync+Hinge, we take . Note that when the maximum staleness is , FedAsync and FedAsync+Hinge with are the same.
(a) Top-1 Accuracy on Testing Set,
(b) Cross Entropy on Training Set,
Figure 10: Metrics at the end of training (at the 2000th epoch), with different . The maximum staleness is . decays by at the th global epoch. For FedAsync+Poly, we take . For FedAsync+Hinge, we take .

6 Experiments

In this section, we empirically evaluate the proposed algorithm.

6.1 Datasets

We conduct experiments on the benchmark CIFAR-10 image classification dataset (Krizhevsky & Hinton, 2009), which is composed of 50k images for training and 10k images for testing. We resize each image and crop it to the shape of . We use convolutional neural network (CNN) with 4 convolutional layers followed by 1 fully connected layer. We chose a simple network architecture so that it can be easily handled by mobile devices. The detailed network architecture can be found in the appendix. In each experiment, the training set is partitioned onto devices. Each of the partitions has images. For any worker, the minibatch size for SGD is .

6.2 Evaluation Specifics

The baseline algorithm is FedAvg introduced by McMahan et al. (2016), which is synchronous federated optimization. The detailed FedAvg is shown in Algorithm 2. For FedAvg, in each epoch, devices are randomly selected to launch local updates. We also use single-thread SGD as the baseline. The detailed SGD is shown in Algorithm 3. For the two baseline algorithms, we use grid search to tune the learning rates and report the best results according to the top-1 accuracy on the testing set.

We repeat each experiment 10 times and take the average. We use top-1 accuracy on the testing set, and cross entropy loss function on the training set as the evaluation metrics.

For convenience, we name Algorithm 1 as FedAsync. We also test the performance of FedAsync with adaptive mixing hyperparameters , as mentioned in Section 4. We employ the following two strategies:

  • Polynomial: .

  • Hinge:

For convenience, we refer to FedAsync with polynomial adaptive as FedAsync+Poly, and FedAsync with hinge adaptive as FedAsync+Hinge.

To compare asynchronous training and synchronous training, we conduct three comparisons: metrics vs. number of global epochs, metrics vs. number of gradients, and metrics vs. number of communications:

  • The number of global epochs counts how many times the global model is updated. The total number of global epochs is in both Algorithm 1 and Algorithm 2. Single-thread SGD does not have global epochs, so we ignore it in the experiments of metrics vs. # of global epochs.

  • The number of gradients is the number of gradients applied to the global model. Note that for both Algorithm 1 and Algorithm 2, an epoch of local iterations is a full pass of the local dataset. Thus, for FedAsync, in each global epoch, gradients is applied to the global model. For FedAvg, since , gradients is applied to the global model in each global epoch.

  • The number of communications measures the communication overhead on the server side. We count how many times the models are exchanged (sent/received) on the server. On average, in each global epoch, FedAvg has the communications of FedAsync. Single-thread SGD has no communication, so we ignore it.

In all the experiments, we simulate the asynchrony by randomly sampling the staleness from a uniform distribution.

6.3 Empirical Results

We test the algorithms with different initial learning rates , regularization weights , mixing hyperparameter , and staleness. decays by at the th global epoch.

In Figure 2 and 3, we show how FedAsync converges when the number of gradients grows. We can see that when the overall staleness is small, FedAsync converges as fast as SGD, and faster than FedAvg. When the staleness is larger, FedAsync converges slower. In the worst case, FedAsync has similar convergence rate as FedAvg. When is too large, the convergence can be unstable. Using adaptive , the convergence can be robust to large . Note that when the maximum staleness is , FedAsync and FedAsync+Hinge with are the same.

In Figure 4 and 5, we show how FedAsync converges when the number of global epochs grows. Obviously, FedAvg makes more progress in each epoch. However, in each epoch, FedAvg has to wait until all the workers respond, while FedAsync only needs one worker’s response to move on to the next epoch.

In Figure 6 and 7, we show how FedAsync converges when the communication overhead grows. With the same amount of communication overhead, FedAsync converges faster than FedAvg when staleness is small. When staleness is large, FedAsync has similar performance as FedAvg.

In Figure 8, we show how staleness affects the convergence of FedAsync. Overall, larger staleness makes the convergence slower, but the influence is not catastrophic. Furthermore, using adaptive mixing hyperparameters, the instability caused by large staleness can be mitigated.

In Figure 9 and 10, we show how affects the convergence of FedAsync. In general, FedAsync is robust to different . Note that the difference is so tiny that we have to zoom in. When the staleness is small, adaptive mixing hyperparameter is less necessary. When the staleness is large, smaller is better for FedAsync, while larger is better for FedAsync+Poly and FedAsync+Hinge. That is because adaptive is automatically adjusted to be smaller when the staleness is large, so that we should not manually decrease .

6.4 Discussion

In general, the convergence rate of FedAsync is between single-thread SGD and FedAvg. Larger and smaller staleness makes FedAsync closer to single-thread SGD. Smaller and larger staleness makes FedAsync closer to FedAvg.

FedAsync is generally insensitive to hyperparameters. When the staleness is large, we can tune to improve the convergence. Without adaptive , smaller is better for larger staleness. For adaptive , the best choice is FedAsync+Poly with .

Larger staleness makes the convergence slower and unstable. There are three ways to control the influence of staleness:

  • On the serve side, the updater thread can drop the updates with large staleness . This can also be viewed as a special case of adaptive mixing hyperparameter . In particular, when the staleness is too large, we can simply take .

  • More generally, using adaptive mixing hyperparameters improves the convergence, as shown in Figure 8. Different strategies has different improvement. So far we find that FedAsync+Poly with has the best performance.

  • On the server side, the scheduler thread can control the assignment of training tasks to the workers. If the on-device training is triggered less frequently, the overall staleness will be smaller.

Systematically, FedAsync has the following advantages compared to FedAvg:

  • Efficiency: The server can receive the updates from the workers at any time. Unlike FedAvg, stragglers’ updates will not be dropped. When the staleness is small, FedAsync converges much faster than FedAvg. In the worst case, when the staleness is large, FedAsync still has similar performance as FedAvg.

  • Flexibility: If some workers are no longer eligible for the training tasks (the devices are no longer idle, charging, or connected to unmetered networks), they can temporarily save the workspace, and continue the training or push the trained model to the server later. This also gives more flexibility to the scheduler on the server. Unlike FedAvg, FedAsync can schedule training tasks even if the workers are currently ineligible, since the server does not wait until the workers respond. The currently ineligible workers can start the training tasks later.

  • Scalability: Compared to FedAvg, FedAsync can handle more workers running in parallel since all the updates on the server and the workers are non-blocking. The server only needs to randomize the responding time of the workers to avoid congesting the network.

7 Conclusion

We proposed a novel asynchronous federated optimization algorithm on non-IID training data. The algorithm has near-linear convergence to a global optimum, for both strongly and non-strongly convex problems, as well as a restricted family of non-convex problems. For future work, we plan to investigate the design of strategies to adaptively tune the mixing hyperparameters.

References

Appendix

8 Proofs

Theorem 1.

Assume that the global loss function is -smooth and -strongly convex, and each worker execute at least local updates before pushing models to the server. Furthermore, we assume that for , and , we have , and . Taking , after global updates on the server, Algorithm 1 with Option I converges to a global optimum :

where .

Proof.

Without loss of generality, we assume that in the epoch, the server receive the model , with time stamp . We assume that is the result of applying local updates to on the th device. We also ignore in and for convenience.

Thus, using smoothness and strong convexity, conditional on , for we have

By telescoping and taking total expectation, after local updates, we have

,

On the server side, we have . Thus, conditional on all , we have

convexity

By telescoping and taking total expectation, after global updates on the server, we have

Theorem 2.

Assume that the global loss function is -smooth and -weakly convex (potentially non-convex), and each worker execute at least local updates before pushing models to the server. Furthermore, we assume that for , and , we have , and , . Taking and , after global updates on the server, Algorithm 1 with Option II converges to a global optimum :

where .

Proof.

Without loss of generality, we assume that in the epoch, the server receive the model , with time stamp . We assume that is the result of applying local updates to on the th device. We also ignore in and for convenience.

Thus, using smoothness and strong convexity, conditional on , for we have

is -strongly convex

Note that for , we have -strongly convex function . Thus, we have

Thus, we have

By telescoping and taking total expectation, after local updates, we have

On the server side, we have . Thus, conditional on all , we have

convexity of