Abstract
Federated learning enables training on a massive number of edge devices. To improve flexibility and scalability, we propose a new asynchronous federated optimization algorithm. We prove that the proposed approach has nearlinear convergence to a global optimum, for both strongly and nonstrongly convex problems, as well as a restricted family of nonconvex problems. Empirical results show that the proposed algorithm converges fast and tolerates staleness.
Asynchronous Federated Optimization
Asynchronous Federated Optimization
Cong Xieuiuc \sysmlauthorOluwasanmi Koyejouiuc \sysmlauthorIndranil Guptauiuc
uiucDepartment of Computer Science, University of Illinois UrbanaChampaign, Illinois, USA
Cong Xiecx2@illinois.edu \sysmlcorrespondingauthorOluwasanmi Koyejosanmi@illinois.edu \sysmlcorrespondingauthorIndranil Guptaindy@illinois.edu
Machine Learning, SysML
1 Introduction
Federated learning (Konevcnỳ et al., 2016; McMahan et al., 2016) enables training a global model on datasets decentrally located on a massive number of resourceweak edge devices. Federated learning is motivated by the massive data generated in our daily life, by edge devices/IoT such as smart phones, wearable devices, sensors, and in smart homes/buildings. Ideally, the larger amounts of training data from diverse users results in improved representation and generalization of machinelearning models. Federated learning is also motivated by the need for privacy preservation. In some scenarios, ondevice training without depositing data in the cloud is legally required by regulations such as US HIPAA laws (HealthInsurance.org, 1996) and Europe’s GDPR law (EU, 2018).
Typically, a federated learning system is composed of servers and workers, whose architecture is similar to parameter servers (Li et al., 2014a, b; Ho et al., 2013). The workers train the models locally on the private data on edge devices. The servers aggregate the learned models from the workers, and produce a global model on the cloud/datacenter. To protect the users’ privacy, the workers do not expose the training data to the servers, and instead only expose the trained model.
We summarize the key properties of federated learning below:

Infrequent task scheduling. Edge devices typically have weak computational capacity and limited battery time. Unlike the traditional distributed machine learning, ondevice federated learning tasks are allowed to be executed only when the device is idle, charging, and connected to unmetered networks (i.e., WiFi) (Bonawitz et al., 2019). The edge devices will ping the servers when they are ready to execute training tasks. The servers will then schedule training tasks on available edge devices. Furthermore, to avoid congesting the network, the server randomizes the checkin time of the workers. As a result, on each edge device, the training task is executed infrequently.

Limited communication. The connection between edge devices and the remote servers may be frequently unavailable, slow, or expensive (in terms of communication costs or in the power of battery). Thus, compared to typical distributed optimization, communication in federated learning needs to be much less frequent.

NonIID training data. Unlike the traditional distributed machine learning, the data on different devices are not mixed and IID, and thus represent nonidentically distributed samples from the population.
We posit that the synchronous flavor of federated optimization is potentially unscalable, inefficient, and inflexible. Previous synchronous training algorithms for federated averaging (McMahan et al., 2016; Bonawitz et al., 2019) can only handle hundreds of devices in parallel, while there are nearly 4 billion mobile phones in total (eMarketer, 2019). Even at smaller scales, like a stadium during a game, there are thousands of devices involved. Too many devices checking in at the same time can congest the network on the server side. Thus, in each global epoch, the server is limited to selecting only from the subset of available devices to trigger the training tasks. Furthermore, since the task scheduling varies from device to device due to limited computational capacity and battery time, it is difficult to synchronize the selected devices at the end of each epoch. Some devices will no longer be available before synchronization. Instead, the server has to determine a timeout threshold to drop the stragglers. If the number of survived devices is too small, the server may have to drop the entire epoch including all the received updates.
To address these issues that arise in synchronous federated optimization, we propose a novel asynchronous federated optimization algorithm. The key idea is to use a weighted average to update the global model. The mixing weight can also be set adaptively as a function of the staleness. We show that taken together, these changes result in an effective asynchronous federated optimization procedure.
The main contributions of our paper are listed as follows:

We propose a new asynchronous federated optimization algorithm with provable convergence under nonIID settings.

We show that the proposed approach has nearlinear convergence to a global optimum, for both strongly and nonstrongly convex problems, as well as a restricted family of nonconvex problems.

We propose strategies for controlling the error caused by asynchrony. We instroduce a mixing hyperparameter which adaptively controls the tradeoff between the convergence rate and variance reduction according to the staleness.

We show empirically that the proposed algorithm converges fast and outperforms synchronous federated optimization.
2 Related Work
Edge computing (Garcia Lopez et al., 2015; Hong et al., 2013) is increasingly applied in various scenarios such as smart home, wearable devices, and sensor networks. Meanwhile, machinelearning applications are also moving from cloud to edge (Cao et al., 2015; Mahdavinejad et al., 2018; Zeydan et al., 2016). Typically, edge devices have weaker computation and communication capacity compared to the workstations and datacenters, due to the weak hardware, limited battery time, and metered networks. As a result, simple machinelearning models such as MobileNet (Howard et al., 2017) have been proposed for the learning tasks on weak devices.
Existing federated optimization methods (Konevcnỳ et al., 2015, 2016; McMahan et al., 2016; Bonawitz et al., 2019) focus on synchronous training. In each global epoch, training tasks are triggered on a subset of workers. However, perhaps due to the bad networking conditions and occasional issues, some worker may fail. When this happens, the server has to wait until a sufficient number of workers respond. Otherwise, the server times out, drops the current epoch, and moves on to the next epoch. As far as we know, this paper is the first to discuss asynchronous training in federated learning with provable convergence.
Asynchronous training (Zinkevich et al., 2009; Lian et al., 2017; Zheng et al., 2017) is widely used in traditional distributed SGD. Typically, asynchronous SGD converges faster than synchronous SGD, especially when the communication latency is high and heterogeneous. However, classic asynchronous SGD directly sends gradients to the servers after each local update, which is not feasible for edge devices due to unreliable and slow communication. In this paper, we take the advantage of asynchronous training, and combine it with federated optimization.
Notation/Term  Description 

Number of devices  
Number of global epochs  
Set of integers  
Minimal number of local iterations  
Number of local iterations in the epoch  
on the th device  
Global model in the epoch on server  
Model initialized from , updated in  
the th local iteration, on the th device  
Dataset on the th device  
Data (minibatch) sampled from  
Learning rate  
Mixing hyperparameter  
Regularization weight  
Staleness  
Function of staleness for adaptive  
All the norms in this paper are norms  
Device  Where the training data are placed 
Worker  One worker on each device, 
process that trains the model 
3 Problem Formulation
We consider federated learning with devices. On each device, there is a worker process that trains the model on local data. The overall goal is to train a global model using data from all the devices.
To do so, we consider the following optimization problem:
where , for , is sampled from the local data on the th device.
Note that different devices have different local datasets, i.e., . Thus, samples drawn from different devices may have different expectations i.e. in general, .
4 Methodology
A single execution of federated optimization has global epochs. In the epoch, the server receives a locally trained model from an arbitrary worker, and updates the global model by weighted averaging:
where is the mixing hyperparameter.
On an arbitrary device , after receiving a global model (potentially stale) from the server, we locally solve the following regularized optimization problem using SGD for arbitrary number of iterations:
The server and workers conduct updates asynchronously. The server immediately updates the global model whenever it receives a local model. The communication between the server and workers is nonblocking.
The detailed algorithm is shown in Algorithm 1. The model parameter is updated in th local iteration after receiving , on the th device. is the data randomly drawn in th local iteration after receiving , on the th device. is the number of local iterations after receiving , on the th device. is the learning rate and is the total number of global epochs.
Remark 1.
On the server side, there are two threads running asynchronously in parallel: scheduler and updater. The scheduler periodically triggers training tasks on some workers. The updater receives locally trained models from workers and updates the global model. There could be multiple updater threads with readwrite lock on the global model, which improves the throughput. The scheduler randomizes the timing of training tasks to avoid overloading the updater thread, and controls the staleness ( in the updater thread). We illustrate the overview of the system in
Intuitively, larger staleness results in greater error when updating the global model. For the local models with large staleness , we decrease to mitigate the error caused by staleness. As shown in Algorithm 1, optionally, we use a function to decide the value of . We list some choices for , parameterized by :

Linear: .

Polynomial: .

Exponential: .

Hinge:
5 Convergence Analysis
In this section, we prove the convergence of Algorithm 1 with nonIID data.
5.1 Assumptions
First, we introduce some definitions and assumptions for our convergence analysis.
Definition 1.
(Smoothness) A differentiable function is smooth if for ,
where .
Definition 2.
(Strong convexity) A differentiable function is strongly convex if for ,
where . Note that if , is convex.
Definition 3.
(Weak convexity) A differentiable function is weakly convex if the function with is convex, where .
Remark 2.
Note that when is weakly convex, then is convex if , and potentially nonconvex if .
Assumption 1.
(Existence of global optimum) We assume that there exists a set , where any element is a global optimum of , , and .
5.2 Convergence Guarantees
Based on the assumptions above, we have the following convergence guarantees. Detailed proofs can be found in the appendix.
Theorem 1.
Assume that the global loss function is smooth and strongly convex, and each worker executes at least local updates before pushing models to the server. Furthermore, we assume that for , and , we have , and . Taking , after global updates/epochs on the server, Algorithm 1 with Option I converges to a global optimum :
where .
Remark 3.
The mixing hyperparameter controls the tradeoff between the convergence rate and additional error caused by variance. When , the convergence rate approaches , with the additional error :
When , . As a result, the variance is reduced to . In practice, to balance the convergence rate and the variance reduction, we use diminishing : , such that the the algorithm converges fast at the beginning, and reduces the variance at the end.
Theorem 2.
Assume that the global loss function is smooth and weakly convex (potentially nonconvex), and each worker executes at least local updates before pushing models to the server. Furthermore, we assume that for , and , we have , and , . Taking and , after global updates/epochs on the server, Algorithm 1 with Option II converges to a global optimum :
where .
6 Experiments
In this section, we empirically evaluate the proposed algorithm.
6.1 Datasets
We conduct experiments on the benchmark CIFAR10 image classification dataset (Krizhevsky & Hinton, 2009), which is composed of 50k images for training and 10k images for testing. We resize each image and crop it to the shape of . We use convolutional neural network (CNN) with 4 convolutional layers followed by 1 fully connected layer. We chose a simple network architecture so that it can be easily handled by mobile devices. The detailed network architecture can be found in the appendix. In each experiment, the training set is partitioned onto devices. Each of the partitions has images. For any worker, the minibatch size for SGD is .
6.2 Evaluation Specifics
The baseline algorithm is FedAvg introduced by McMahan et al. (2016), which is synchronous federated optimization. The detailed FedAvg is shown in Algorithm 2. For FedAvg, in each epoch, devices are randomly selected to launch local updates. We also use singlethread SGD as the baseline. The detailed SGD is shown in Algorithm 3. For the two baseline algorithms, we use grid search to tune the learning rates and report the best results according to the top1 accuracy on the testing set.
We repeat each experiment 10 times and take the average. We use top1 accuracy on the testing set, and cross entropy loss function on the training set as the evaluation metrics.
For convenience, we name Algorithm 1 as FedAsync. We also test the performance of FedAsync with adaptive mixing hyperparameters , as mentioned in Section 4. We employ the following two strategies:

Polynomial: .

Hinge:
For convenience, we refer to FedAsync with polynomial adaptive as FedAsync+Poly, and FedAsync with hinge adaptive as FedAsync+Hinge.
To compare asynchronous training and synchronous training, we conduct three comparisons: metrics vs. number of global epochs, metrics vs. number of gradients, and metrics vs. number of communications:

The number of gradients is the number of gradients applied to the global model. Note that for both Algorithm 1 and Algorithm 2, an epoch of local iterations is a full pass of the local dataset. Thus, for FedAsync, in each global epoch, gradients is applied to the global model. For FedAvg, since , gradients is applied to the global model in each global epoch.

The number of communications measures the communication overhead on the server side. We count how many times the models are exchanged (sent/received) on the server. On average, in each global epoch, FedAvg has the communications of FedAsync. Singlethread SGD has no communication, so we ignore it.
In all the experiments, we simulate the asynchrony by randomly sampling the staleness from a uniform distribution.
6.3 Empirical Results
We test the algorithms with different initial learning rates , regularization weights , mixing hyperparameter , and staleness. decays by at the th global epoch.
In Figure 2 and 3, we show how FedAsync converges when the number of gradients grows. We can see that when the overall staleness is small, FedAsync converges as fast as SGD, and faster than FedAvg. When the staleness is larger, FedAsync converges slower. In the worst case, FedAsync has similar convergence rate as FedAvg. When is too large, the convergence can be unstable. Using adaptive , the convergence can be robust to large . Note that when the maximum staleness is , FedAsync and FedAsync+Hinge with are the same.
In Figure 4 and 5, we show how FedAsync converges when the number of global epochs grows. Obviously, FedAvg makes more progress in each epoch. However, in each epoch, FedAvg has to wait until all the workers respond, while FedAsync only needs one worker’s response to move on to the next epoch.
In Figure 6 and 7, we show how FedAsync converges when the communication overhead grows. With the same amount of communication overhead, FedAsync converges faster than FedAvg when staleness is small. When staleness is large, FedAsync has similar performance as FedAvg.
In Figure 8, we show how staleness affects the convergence of FedAsync. Overall, larger staleness makes the convergence slower, but the influence is not catastrophic. Furthermore, using adaptive mixing hyperparameters, the instability caused by large staleness can be mitigated.
In Figure 9 and 10, we show how affects the convergence of FedAsync. In general, FedAsync is robust to different . Note that the difference is so tiny that we have to zoom in. When the staleness is small, adaptive mixing hyperparameter is less necessary. When the staleness is large, smaller is better for FedAsync, while larger is better for FedAsync+Poly and FedAsync+Hinge. That is because adaptive is automatically adjusted to be smaller when the staleness is large, so that we should not manually decrease .
6.4 Discussion
In general, the convergence rate of FedAsync is between singlethread SGD and FedAvg. Larger and smaller staleness makes FedAsync closer to singlethread SGD. Smaller and larger staleness makes FedAsync closer to FedAvg.
FedAsync is generally insensitive to hyperparameters. When the staleness is large, we can tune to improve the convergence. Without adaptive , smaller is better for larger staleness. For adaptive , the best choice is FedAsync+Poly with .
Larger staleness makes the convergence slower and unstable. There are three ways to control the influence of staleness:

On the serve side, the updater thread can drop the updates with large staleness . This can also be viewed as a special case of adaptive mixing hyperparameter . In particular, when the staleness is too large, we can simply take .

More generally, using adaptive mixing hyperparameters improves the convergence, as shown in Figure 8. Different strategies has different improvement. So far we find that FedAsync+Poly with has the best performance.

On the server side, the scheduler thread can control the assignment of training tasks to the workers. If the ondevice training is triggered less frequently, the overall staleness will be smaller.
Systematically, FedAsync has the following advantages compared to FedAvg:

Efficiency: The server can receive the updates from the workers at any time. Unlike FedAvg, stragglers’ updates will not be dropped. When the staleness is small, FedAsync converges much faster than FedAvg. In the worst case, when the staleness is large, FedAsync still has similar performance as FedAvg.

Flexibility: If some workers are no longer eligible for the training tasks (the devices are no longer idle, charging, or connected to unmetered networks), they can temporarily save the workspace, and continue the training or push the trained model to the server later. This also gives more flexibility to the scheduler on the server. Unlike FedAvg, FedAsync can schedule training tasks even if the workers are currently ineligible, since the server does not wait until the workers respond. The currently ineligible workers can start the training tasks later.

Scalability: Compared to FedAvg, FedAsync can handle more workers running in parallel since all the updates on the server and the workers are nonblocking. The server only needs to randomize the responding time of the workers to avoid congesting the network.
7 Conclusion
We proposed a novel asynchronous federated optimization algorithm on nonIID training data. The algorithm has nearlinear convergence to a global optimum, for both strongly and nonstrongly convex problems, as well as a restricted family of nonconvex problems. For future work, we plan to investigate the design of strategies to adaptively tune the mixing hyperparameters.
References
 Bonawitz et al. (2019) Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Ingerman, A., Ivanov, V., Kiddon, C., Konecny, J., Mazzocchi, S., McMahan, H. B., et al. Towards federated learning at scale: System design. arXiv preprint arXiv:1902.01046, 2019.
 Cao et al. (2015) Cao, Y., Hou, P., Brown, D., Wang, J., and Chen, S. Distributed analytics and edge intelligence: Pervasive health monitoring at the era of fog computing. In Proceedings of the 2015 Workshop on Mobile Big Data, pp. 43–48. ACM, 2015.

eMarketer (2019)
eMarketer.
Number of mobile phone users worldwide from 2015 to 2020 (in
billions).
2019.
https://www.statista.com/statistics/274774/forecastofmobilephoneusers
worldwide/, Last visited: Mar. 2019.  EU (2018) EU. European Union’s General Data Protection Regulation (GDPR). 2018. https://eugdpr.org/, Last visited: Nov. 2018.
 Garcia Lopez et al. (2015) Garcia Lopez, P., Montresor, A., Epema, D., Datta, A., Higashino, T., Iamnitchi, A., Barcellos, M., Felber, P., and Riviere, E. Edgecentric computing: Vision and challenges. ACM SIGCOMM Computer Communication Review, 45(5):37–42, 2015.
 HealthInsurance.org (1996) HealthInsurance.org, S. A. Health insurance portability and accountability act of 1996. Public law, 104:191, 1996.
 Ho et al. (2013) Ho, Q., Cipar, J., Cui, H., Lee, S., Kim, J. K., Gibbons, P. B., Gibson, G. A., Ganger, G., and Xing, E. P. More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pp. 1223–1231, 2013.
 Hong et al. (2013) Hong, K., Lillethun, D., Ramachandran, U., Ottenwälder, B., and Koldehofe, B. Mobile fog: A programming model for largescale applications on the internet of things. In Proceedings of the second ACM SIGCOMM workshop on Mobile cloud computing, pp. 15–20. ACM, 2013.
 Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Konevcnỳ et al. (2015) Konevcnỳ, J., McMahan, B., and Ramage, D. Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575, 2015.
 Konevcnỳ et al. (2016) Konevcnỳ, J., McMahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., and Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
 Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Li et al. (2014a) Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.Y. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pp. 583–598, 2014a.
 Li et al. (2014b) Li, M., Andersen, D. G., Smola, A. J., and Yu, K. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pp. 19–27, 2014b.
 Lian et al. (2017) Lian, X., Zhang, W., Zhang, C., and Liu, J. Asynchronous decentralized parallel stochastic gradient descent. arXiv preprint arXiv:1710.06952, 2017.
 Mahdavinejad et al. (2018) Mahdavinejad, M. S., Rezvan, M., Barekatain, M., Adibi, P., Barnaghi, P., and Sheth, A. P. Machine learning for internet of things data analysis: A survey. Digital Communications and Networks, 4(3):161–175, 2018.
 McMahan et al. (2016) McMahan, H. B., Moore, E., Ramage, D., Hampson, S., et al. Communicationefficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
 Zeydan et al. (2016) Zeydan, E., Bastug, E., Bennis, M., Kader, M. A., Karatepe, I. A., Er, A. S., and Debbah, M. Big data caching for networking: Moving from cloud to edge. IEEE Communications Magazine, 54(9):36–42, 2016.
 Zheng et al. (2017) Zheng, S., Meng, Q., Wang, T., Chen, W., Yu, N., Ma, Z.M., and Liu, T.Y. Asynchronous stochastic gradient descent with delay compensation. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 4120–4129. JMLR. org, 2017.
 Zinkevich et al. (2009) Zinkevich, M., Langford, J., and Smola, A. J. Slow learners are fast. In Advances in neural information processing systems, pp. 2331–2339, 2009.
Appendix
8 Proofs
Theorem 1.
Assume that the global loss function is smooth and strongly convex, and each worker execute at least local updates before pushing models to the server. Furthermore, we assume that for , and , we have , and . Taking , after global updates on the server, Algorithm 1 with Option I converges to a global optimum :
where .
Proof.
Without loss of generality, we assume that in the epoch, the server receive the model , with time stamp . We assume that is the result of applying local updates to on the th device. We also ignore in and for convenience.
Thus, using smoothness and strong convexity, conditional on , for we have
By telescoping and taking total expectation, after local updates, we have
,  
On the server side, we have . Thus, conditional on all , we have
convexity  
By telescoping and taking total expectation, after global updates on the server, we have
∎
Theorem 2.
Assume that the global loss function is smooth and weakly convex (potentially nonconvex), and each worker execute at least local updates before pushing models to the server. Furthermore, we assume that for , and , we have , and , . Taking and , after global updates on the server, Algorithm 1 with Option II converges to a global optimum :
where .
Proof.
Without loss of generality, we assume that in the epoch, the server receive the model , with time stamp . We assume that is the result of applying local updates to on the th device. We also ignore in and for convenience.
Thus, using smoothness and strong convexity, conditional on , for we have
is strongly convex  
Note that for , we have strongly convex function . Thus, we have
Thus, we have
By telescoping and taking total expectation, after local updates, we have
On the server side, we have . Thus, conditional on all , we have
convexity of  