Federated Learning: Challenges, Methods, and Future Directions
Federated learning involves training statistical models over remote devices or siloed data centers, such as mobile phones or hospitals, while keeping data localized. Training in heterogeneous and potentially massive networks introduces novel challenges that require a fundamental departure from standard approaches for large-scale machine learning, distributed optimization, and privacy-preserving data analysis. In this article, we discuss the unique characteristics and challenges of federated learning, provide a broad overview of current approaches, and outline several directions of future work that are relevant to a wide range of research communities.
Mobile phones, wearable devices, and autonomous vehicles are just a few of the modern distributed networks generating a wealth of data each day. Due to the growing computational power of these devices—coupled with concerns over transmitting private information—it is increasingly attractive to store data locally and push network computation to the edge.
The concept of edge computing is not a new one. Indeed, computing simple queries across distributed, low-powered devices is a decades-long area of research that has been explored under the purview of query processing in sensor networks, computing at the edge, and fog computing [73, 28, 11, 48, 39]. Recent works have also considered training machine learning models centrally but serving and storing them locally; for example, this is a common approach in mobile user modeling and personalization [59, 89].
However, as the storage and computational capabilities of the devices within distributed networks grow, it is possible to leverage enhanced local resources on each device. This has led to a growing interest in federated learning , which
explores training statistical models directly on remote devices
Federated learning methods have been deployed by major service providers [9, 124], and play a critical role in supporting privacy-sensitive applications where the training data are distributed at the edge [e.g., 138, 88, 50, 104, 45, 127, 4]. Examples of potential applications include: learning sentiment, semantic location, or activities of mobile phone users; adapting to pedestrian behavior in autonomous vehicles; and predicting health events like heart attack risk from wearable devices [5, 83, 51]. We discuss several canonical applications of federated learning below:
Smart phones. By jointly learning user behavior across a large pool of mobile phones, statistical models can power applications such as next-word prediction, face detection, and voice recognition [88, 45]. However, users may not be willing to share their data in order to protect their personal privacy or to save the limited bandwidth/battery power of their phone. Federated learning has the potential to enable predictive features on smart phones without diminishing the user experience or leaking private information. Figure 1 depicts one such application in which we aim to learn a next-word predictor in a large-scale mobile phone network based on users’ historical text data .
Organizations. Organizations or institutions can also be viewed as ‘devices’ in the context of federated learning. For example, hospitals are organizations that contain a multitude of patient data for predictive healthcare. However, hospitals operate under strict privacy practices, and may face legal, administrative, or ethical constraints that require data to remain local. Federated learning is a promising solution for these applications , as it can reduce strain on the network and enable private learning between various devices/organizations.
Internet of things. Modern IoT networks, such as wearable devices, autonomous vehicles, or smart homes, may contain numerous sensors that allow them to collect, react, and adapt to incoming data in real-time. For example, a fleet of autonomous vehicles may require an up-to-date model of traffic, construction, or pedestrian behavior to safely operate. However, building aggregate models in these scenarios may be difficult due to the private nature of the data and the limited connectivity of each device. Federated learning methods can help to train models that efficiently adapt to changes in these systems while maintaining user privacy [97, 83].
1.1 Problem Formulation
The canonical federated learning problem involves learning a single, global statistical model from data stored on tens to potentially millions of remote devices. We aim to learn this model under the constraint that device-generated data is stored and processed locally, with only intermediate updates being communicated periodically with a central server. In particular, the goal is typically to minimize the following objective function:
Here, is the total number of devices, and , and is the local objective function for the th device. The local objective function is often defined as the empirical risk over local data, i.e., , where is the number of samples available locally. The user-defined term specifies the relative impact of each device, with two natural settings being or , where is the total number of samples. We will reference problem (1) throughout the article, but, as discussed below, we note that other objectives or modeling approaches may be appropriate depending on the application of interest.
1.2 Core Challenges
We next describe four of the core challenges associated with solving the distributed optimization problem posed in (1). These challenges make the federated setting distinct from other classical problems, such as distributed learning in data center settings or traditional private data analyses.
Challenge 1: Expensive Communication. Communication is a critical bottleneck in federated networks, which, coupled with privacy concerns over sending raw data, necessitates that data generated on each device remain local. Indeed, federated networks are potentially comprised of a massive number of devices, e.g., millions of smart phones, and communication in the network can be slower than local computation by many orders of magnitude [115, 49]. In order to fit a model to data generated by the devices in the federated network, it is therefore necessary to develop communication-efficient methods that iteratively send small messages or model updates as part of the training process, as opposed to sending the entire dataset over the network. To further reduce communication in such a setting, two key aspects to consider are: (i) reducing the total number of communication rounds, or (ii) reducing the size of transmitted messages at each round.
Challenge 2: Systems Heterogeneity. The storage, computational, and communication capabilities of each device in federated networks may differ due to variability in hardware (CPU, memory), network connectivity (3G, 4G, 5G, wifi), and power (battery level). Additionally, the network size and systems-related constraints on each device typically result in only a small fraction of the devices being active at once, e.g., hundreds of active devices in a million-device network . Each device may also be unreliable, and it is not uncommon for an active device to drop out at a given iteration due to connectivity or energy constraints. These system-level characteristics dramatically exacerbate challenges such as straggler mitigation and fault tolerance. Federated learning methods that are developed and analyzed must therefore: (i) anticipate a low amount of participation, (ii) tolerate heterogeneous hardware, and (iii) be robust to dropped devices in the network.
Challenge 3: Statistical Heterogeneity. Devices frequently generate and collect data in a non-identically distributed manner across the network, e.g., mobile phone users have varied use of language in the context of a next word prediction task. Moreover, the number of data points across devices may vary significantly, and there may be an underlying structure present that captures the relationship amongst devices and their associated distributions. This data generation paradigm violates frequently-used independent and identically distributed (I.I.D.) assumptions in distributed optimization, increases the likelihood of stragglers, and may add complexity in terms of modeling, analysis, and evaluation. Indeed, although the canonical federated learning problem of (1) aims to learn a single global model, there exist other alternatives such as simultaneously learning distinct local models via multi-task learning frameworks [cf. 105]. There is also a close connection in this regard between leading approaches for federated learning and meta-learning . Both the multi-task and meta-learning perspectives enable personalized or device-specific modeling, which is often a more natural approach to handle the statistical heterogeneity of the data.
Challenge 4: Privacy Concerns. Finally, privacy is often a major concern in federated learning applications. Federated learning makes a step towards protecting data generated on each device by sharing model updates, e.g., gradient information, instead of the raw data [29, 32, 16]. However, communicating model updates throughout the training process can nonetheless reveal sensitive information, either to a third-party, or to the central server . While recent methods aim to enhance the privacy of federated learning using tools such as secure multiparty computation or differential privacy, these approaches often provide privacy at the cost of reduced model performance or system efficiency. Understanding and balancing these trade-offs, both theoretically and empirically, is a considerable challenge in realizing private federated learning systems.
2 Survey of Related and Current Work
The challenges in federated learning at first glance resemble classical problems in areas such as privacy, large-scale machine learning, and distributed optimization. For instance, numerous methods have been proposed to tackle expensive communication in the machine learning, optimization, and signal processing communities. However, these methods are typically unable to fully handle the scale of federated networks, much less the challenges of systems and statistical heterogeneity. Similarly, while privacy is an important aspect for many machine learning applications, privacy-preserving methods for federated learning can be challenging to rigorously assert due to the statistical variation in the data, and may be even more difficult to implement due to systems constraints on each device and across the potentially massive network. In this section, we explore in more detail the challenges presented in Section 1, including a discussion of classical results as well as more recent work focused specifically on federated learning.
Communication is a key bottleneck to consider when developing methods for federated networks. While it is beyond the scope of this article to provide a self-contained review of communication-efficient distributed learning methods, we point out several general directions, which we group into (1) local updating methods, (2) compression schemes, and (3) decentralized training.
Mini-batch optimization methods, which involve extending classical stochastic methods to process multiple data points at a time, have emerged as a popular paradigm for distributed machine learning in data center environments [27, 101, 102, 87, 95]. In practice, however, they have been shown to have limited flexibility to adapt to communication-computation trade-offs that would maximally leverage distributed data processing [106, 107]. In response, several recent methods have been proposed to improve communication-efficiency in distributed settings by allowing for a variable number of local updates to be applied on each machine in parallel at each communication round, making the amount of computation versus communication substantially more flexible. For convex objectives, distributed local-updating primal-dual methods have emerged as a popular way to tackle such a problem [106, 61, 71, 53, 128]. These approaches leverage duality structure to effectively decompose the global objective into subproblems that can be solved in parallel at each communication round. Several distributed local-updating primal methods have also been proposed, which have the added benefit of being applicable to non-convex objectives [136, 92]. These methods drastically improve performance in practice, and have been shown to achieve orders-of-magnitude speedups over traditional mini-batch methods or distributed approaches like ADMM  in real-world data center environments. We provide an intuitive illustration of local updating methods in Figure 2.
In federated settings, optimization methods that allow for flexible local updating and low client participation have become the de facto solvers [74, 105, 64]. The most commonly used method for federated learning is Federated Averaging (FedAvg) , a method based on averaging local stochastic gradient descent (SGD) updates for the primal problem. FedAvg has been shown to work well empirically, particularly for non-convex problems, but comes without convergence guarantees and can diverge in practical settings when data are heterogeneous . We discuss methods to handle such statistical heterogeneity in more detail in Section 2.3.2.
While local updating methods can reduce the total number of communication rounds, model compression schemes such as sparsification, subsampling, and quantization can significantly reduce the size of messages communicated at each round. These methods have been extensively studied, both empirically and theoretically, in previous literature for distributed training in data center environments; we defer the readers to [119, 135] for a more complete review. In federated environments, the low participation of devices, non-identically distributed local data, and local updating schemes pose novel challenges to these model compression approaches. For instance, the commonly-used error compensation techniques in classical distributed learning  cannot be directly extended to federated settings as the errors accumulated locally may be stale if the devices are not frequently sampled. Nevertheless, several works have provided practical strategies in federated settings, such as forcing the updating models to be sparse and low-rank; performing quantization with structured random rotations ; using lossy compression and dropout to reduce server-to-device communication ; and applying Golomb lossless encoding . From a theoretical perspective, while prior work has explored convergence guarantees with low-precision training in the presence of non-identically distributed data [e.g., 110], the assumptions made do not take into consideration common characteristics of the federated setting, such as low device participation or locally-updating optimization methods.
In federated learning, a star network (where a central server is connected to a network of devices, as in the left panel of Figure 3) is the predominant communication topology; we therefore focus on the star-network setting in this article. However, we briefly discuss decentralized topologies (where devices only communicate with their neighbors, e.g., the right panel of Figure 3) as a potential alternative. In data center environments, decentralized training has been demonstrated to be faster than centralized training when operating on networks with low bandwidth or high latency; we defer readers to [46, 66] for a more comprehensive review. Similarly, in federated learning, decentralized algorithms can in theory reduce the high communication cost on the central server. Some recent works [46, 60] have investigated decentralized training over heterogeneous data with local updating schemes. However, they are either restricted to linear models  or assume full device participation . Finally, hierarchical communication patterns have also been proposed [69, 67] to further ease the burden on the central server, by first leveraging edge servers to aggregate the updates from edge devices and then relying on a cloud server to aggregate updates from edge servers. While this is a promising approach to reduce communication, it is not applicable to all networks, as this type of physical hierarchy may not exist or be known a priori.
2.2 Systems Heterogeneity
In federated settings, there is significant variability in the systems characteristics across the network, as devices may differ in terms of hardware, network connectivity, and battery power. As depicted in Figure 4, these systems characteristics make issues such as stragglers significantly more prevalent than in typical data center environments. We roughly group several key directions to handle systems heterogeneity into: (i) asynchronous communication, (ii) active device sampling, and (ii) fault tolerance. As mentioned in Section 2.1.3, we assume a star topology in our following discussions.
In traditional data center settings, synchronous and asynchronous schemes are both commonly used to parallelize iterative optimization algorithms, with each approach having pros and cons. Synchronous schemes are simple and guarantee a serial-equivalent computational model, but they are also more susceptible to stragglers in the face of device variability. Asynchronous schemes are an attractive approach to mitigate stragglers in heterogeneous environments, particularly in shared-memory systems [91, 30, 141, 47, 26]. However, they typically rely on bounded-delay assumptions to control the degree of staleness, which for device depends on the number of other devices that have updated since device pulled from the central server. While asynchronous parameter servers have been successful in distributed data centers [e.g., 141, 47, 26], classical bounded-delay assumptions can be unrealistic in federated settings, where the delay may be on the order of hours to days, or completely unbounded.
In federated networks, typically only a small subset of devices participate at each round of training. However, the vast majority of federated methods, e.g. those described in [74, 105, 64, 9, 46], are passive in that they do not aim to influence which devices participate. An alternative approach involves actively selecting participating devices at each round. For example, Nishio and Yonetani  explore novel device sampling policies based on systems resources, with the aim being for the server to aggregate as many device updates as possible within a pre-defined time window. Similarly, Kang et al.  take into account systems overheads incurred on each device when designing incentive mechanisms to encourage devices with higher-quality data to participate in the learning process. However, these methods assume a static model of the systems characteristics of the network; it remains open how to extend these approaches to handle real-time, device-specific fluctuations in computation and communication delays. Moreover, while these methods primarily focus on systems variability to perform active sampling, we note that it is also worth considering actively sampling a set of small but sufficiently representative devices based on the underlying statistical structure.
Fault tolerance has been extensively studied in the systems community and is a fundamental consideration of classical distributed systems [109, 70, 18]. Recent works have also investigated fault tolerance specifically for machine learning workloads in data center environments [e.g., 111, 86]. When learning over remote devices, however, fault tolerance becomes more critical as it is common for some participating devices to drop out at some point before the completion of the given training iteration . One practical strategy is to simply ignore such device failure , which may introduce bias into the device sampling scheme if the failed devices have specific data characteristics. For instance, devices from remote areas may be more likely to drop due to poor network connections and thus the trained federated model will be biased towards devices with favorable network conditions. Theoretically, while several recent works have investigated convergence guarantees of variants of federated learning methods [131, 122, 55, 132], few analyses allow for low participation [e.g., 64, 105], or study directly the effect of dropped devices.
Coded computation is another option to tolerate device failures by introducing algorithmic redundancy. Recent works have explored using codes to speed up distributed machine learning training [e.g., 62, 108, 20, 93, 19]. For instance, in the presence of stragglers, gradient coding and its variants [108, 20, 19] carefully replicate data blocks (as well as the gradient computation on those data blocks) across computing nodes to obtain either exact or inexact recovery of the true gradients. While this is a seemingly promising approach for the federated setting, these methods face fundamental challenges in federated networks as sharing data/replication across devices is often infeasible due to privacy constraints and the scale of the network.
2.3 Statistical Heterogeneity
Challenges arise when training federated models from data that is not identically distributed across devices, both in terms of modeling the data (as depicted in Figure 5), and in terms of analyzing the convergence behavior of associated training procedures. We discuss related work in these directions below.
Modeling Heterogeneous Data
There exists a large body of literature in machine learning that has modeled statistical heterogeneity via methods such as meta-learning  and multi-task learning [17, 36]; these ideas have been recently extended to the federated setting [23, 105, 25, 57, 34, 139]. For instance, MOCHA , an optimization framework designed for the federated setting, can allow for personalization by learning separate but related models for each device while leveraging a shared representation via multi-task learning. This method has provable theoretical convergence guarantees for the considered objectives, but is limited in its ability to scale to massive networks and is restricted to convex objectives. Another approach  models the star topology as a Bayesian network and performs variational inference during learning. Although this method can handle non-convex models, it is expensive to generalize to large federated networks. Khodak et al.  provably meta-learn a within-task learning rate using multi-task information (where each task corresponds to a device) and have demonstrated improved empirical performance over vanilla FedAvg. Eichner et al.  investigate a pluralistic solution (adaptively choosing between a global model and device-specific models) to address the cyclic patterns in data samples during federated training. Zhao et al.  explore transfer learning for personalization by running FedAvg after training a global model centrally on some shared proxy data. Despite these recent advances, key challenges still remain in making methods for heterogeneous modeling that are robust, scalable, and automated in federated settings.
When modeling federated data, it may also be important to consider issues beyond accuracy, such as fairness. In particular, naively solving an aggregate loss function such as in (1) may implicitly advantage or disadvantage some of the devices, as the learned model may become biased towards devices with larger amounts of data, or (if weighting devices equally), to commonly occurring groups of devices. Recent works have proposed modified modeling approaches that aim to reduce the variance of the model performance across devices. Some heuristics simply perform a varied number of local updates based on local loss . Other more principled approaches include Agnostic Federated Learning , which optimizes the centralized model for any target distribution formed by a mixture of the client distributions via a minimax optimization scheme. Another more general approach is taken by Li et al. , which proposes an objective called -FFL in which devices with higher loss are given higher relative weight to encourage less variance in the final accuracy distribution. Beyond issues of fairness, we note that aspects such as accountability and interpretability in federated learning are additionally worth exploring, but may be challenging due to the scale and heterogeneity of the network.
Convergence Guarantees for Non-IID Data
Statistical heterogeneity also presents novel challenges in terms of analyzing the convergence behavior in federated settings—even when learning a single global model. Indeed, when data is not identically distributed across devices in the network, methods such as FedAvg have been shown to diverge in practice [64, 74]. Parallel SGD and related variants, which make local updates similar to FedAvg, have been analyzed in the I.I.D. setting [67, 92, 103, 107, 120, 125, 136, 140, 123, 121]. However, the results rely on the premise that each local solver is a copy of the same stochastic process (due to the I.I.D. assumption), which is not the case in typical federated settings. To understand the performance of FedAvg in statistically heterogeneous settings, FedProx  has recently been proposed. FedProx makes a small modification to the FedAvg method to help ensure convergence, both theoretically and in practice. FedProx can also be interpreted as a generalized, reparameterized version of FedAvg that has practical ramifications in the context of accounting for systems heterogeneity across devices. Several other works [122, 131, 55, 132] have also explored convergence guarantees in the presence of heterogeneous data with different assumptions, e.g., convexity  or uniformly bounded gradients . There are also heuristic approaches that aim to tackle statistical heterogeneity, either by sharing local device data or some server-side proxy data [51, 54, 139]. However, these methods may be unrealistic: in addition to imposing burdens on network bandwidth, sending local data to the server  violates the key privacy assumption of federated learning, and sending globally-shared proxy data to all devices [51, 139] requires effort to carefully generate or collect such auxiliary data.
Privacy concerns often motivate the need to keep raw data on each device local in federated settings. However, sharing other information such as model updates as part of the training process can also leak sensitive user information [7, 16, 77, 38]. For instance, Carlini et al.  demonstrate that one can extract sensitive text patterns, e.g., a specific credit card number, from a recurrent neural network trained on users’ language data. Given increasing interest in privacy-preserving learning approaches, in Section 2.4.1, we first briefly revisit prior work on enhancing privacy in the general (distributed) machine learning setting. We then review recent privacy-preserving methods specifically designed for federated settings in Section 2.4.2.
Privacy in Machine Learning
Privacy-preserving learning has been extensively studied by the machine learning [e.g., 75], systems [e.g., 9, 3], and theory [e.g., 37, 68] communities. Three main strategies, each of which we will briefly review, include differential privacy to communicate noisy data sketches, homomorphic encryption to operate on encrypted data, and secure function evaluation or multiparty computation.
Among these various privacy approaches, differential privacy [31, 32, 33] is most widely used due to its strong information theoretic guarantees, algorithmic simplicity, and relatively small systems overhead. Simply put, a randomized mechanism is differentially private if the change of one input element will not result in too much difference in the output distribution; this means that one cannot draw any conclusions about whether or not a specific sample is used in the learning process. Such sample-level privacy can be achieved in many learning tasks [21, 6, 1, 84, 85, 52]. For gradient-based learning methods, a popular approach is to apply differential privacy by randomly perturbing the intermediate output at each iteration [e.g., 1, 6, 126]. Before applying the perturbation, e.g., via Gaussian noise , Laplacian noise , or Binomial noise , it is common to clip the gradients in order to bound the influence of each example on the overall update. There exists an inherent trade-off between differential privacy and model accuracy, as adding more noise results in greater privacy, but may compromise accuracy significantly. Despite the fact that differential privacy is the de facto metric for privacy in machine learning, there are many other privacy definitions, such as -anonymity , -presence  and distance correlation , that may be applicable to different learning problems .
Beyond differential privacy, homomorphic encryption can be used to secure the learning process by computing on encrypted data, although it has currently been applied in limited settings, e.g., training linear models  or involving only a few entities . When the sensitive datasets are distributed across different data owners, another natural option is to perform privacy-preserving learning via secure function evaluation (SFE) or secure multiparty computation (SMC). The resulting protocols can enable multiple parties to collaboratively compute an agreed-upon function without leaking input information from any party except for what can be inferred from the output [e.g., 42, 22, 94]. Thus, while SMC cannot guarantee protection from information leakage, it can be combined with differential privacy to achieve stronger privacy guarantees. However, approaches along these lines may not be applicable to large-scale machine learning scenarios as they incur substantial additional communication and computation costs. Moreover, SMC protocols need to be carefully designed and implemented for each operation in the targeted learning algorithm [24, 78]. We defer interested readers to [12, 96] for a more comprehensive review of the approaches based on homomorphic encryption and SMC.
Privacy in Federated Learning
The federated setting poses novel challenges to existing privacy-preserving algorithms. Beyond providing rigorous privacy guarantees, it is necessary to develop methods that are computationally cheap, communication-efficient, and tolerant to dropped devices—all without overly compromising accuracy. Although there are a variety of privacy definitions in federated learning [7, 16, 75, 113, 40, 63], typically they can be classified into two categories: global privacy and local privacy. As demonstrated in Figure 6, global privacy requires that the model updates generated at each round are private to all untrusted third parties other than the central server, while local privacy further requires that the updates are also private to the server.
Current works that aim to improve the privacy of federated learning typically build upon previous classical cryptographic protocols such as SMC [10, 41] and differential privacy [40, 75, 7, 2]. Bonawitz et al.  introduce an SMC protocol to protect individual model updates. The central server is not able to see any local updates, but can still observe the exact aggregated results at each round. SMC is a lossless method, and can retain the original accuracy with a very high privacy guarantee. However, the resulting method incurs significant extra communication cost. Other works [75, 40] apply differential privacy to federated learning and offer global differential privacy. These approaches have a number of hyperparameters that affect communication and accuracy that must be carefully chosen, though follow up work  proposes adaptive gradient clipping strategies to help alleviate this issue. In the case where stronger privacy guarantees are required, Bhowmick et al.  introduce a relaxed version of local privacy by limiting the power of potential adversaries. It affords stronger privacy guarantees than global privacy, and has better model performance than strict local privacy. Li et al.  propose locally differentially-private algorithms in the context of meta-learning, which can be applied to federated learning with personalization, while also providing provable learning guarantees in convex settings. In addition, differential privacy can be combined with model compression techniques to reduce communication and obtain privacy benefits simultaneously .
3 Future Directions
Federated learning is an active and ongoing area of research. Although recent work has begun to address the challenges discussed in Section 2, there are a number of critical open directions yet to be explored. In this section, we briefly outline a few promising research directions surrounding the previously discussed challenges (expensive communication, systems heterogeneity, statistical heterogeneity, and privacy concerns), and introduce additional challenges regarding issues such as productionizing and benchmarking in federated settings.
Extreme communication schemes. It remains to be seen how much communication is necessary in federated learning. Indeed, it is well-known that optimization methods for machine learning can tolerate a lack of precision; this error can in fact help with generalization . While one-shot or divide-and-conquer communication schemes have been explored in traditional data center settings [137, 72], the behavior of these methods is not well-understood in massive or statistical heterogeneous networks. Similarly, one-shot/few-shot heuristics [134, 43, 44] have recently been proposed for the federated setting, but have yet to be theoretically analyzed or evaluated at scale.
Communication reduction and the Pareto frontier. We discussed several ways to reduce communication in federated training, such as local updating and model compression. In order to create a realistic system for federated learning, it is important to understand how these techniques compose with one another, and to systematically analyze the trade-off between accuracy and communication for each approach. In particular, the most useful techniques will demonstrate improvements at the Pareto frontier—achieving an accuracy greater than any other approach under the same communication budget, and ideally, across a wide range of communication/accuracy profiles. Similar comprehensive analyses have been performed for efficient neural network inference [e.g., 8], and are necessary in order to compare communication-reduction techniques for federated learning in a meaningful way.
Novel models of asynchrony. As discussed in Section 2.2.1, two communication schemes most commonly studied in distributed optimization are bulk synchronous approaches and asynchronous approaches (where it is assumed that the delay is bounded). These schemes are more realistic in data center settings—where worker nodes are typically dedicated to the workload, i.e., they are ready to ‘pull’ their next job from the central node immediately after they ‘push’ the results of their previous job. In contrast, in federated networks, each device is often undedicated to the task at hand and most devices are not active on any given iteration. Therefore, it is worth studying the effects of this more realistic device-centric communication scheme—in which each device can decide when to ‘wake up’ and interact with the central server in an event-triggered manner.
Heterogeneity diagnostics. Recent works have aimed to quantify statistical heterogeneity through metrics such as local dissimilarity (as defined in the context of federated learning in  and used for other purposes in works such as [130, 99, 116]) and earth mover’s distance . However, these metrics cannot be easily calculated over the federated network before training occurs. The importance of these metrics motivates the following open questions: (i) Do simple diagnostics exist to quickly determine the level of heterogeneity in federated networks a priori? (ii) Can analogous diagnostics be developed to quantify the amount of systems-related heterogeneity? (iii) Can current or new definitions of heterogeneity be exploited to further improve the convergence of federated optimization methods?
Granular privacy constraints. The definitions of privacy outlined in Section 2.4.2 cover privacy at a local or global level with respect to all devices in the network. However, in practice, it may be necessary to define privacy on a more granular level, as privacy constraints may differ across devices or even across data points on a single device. For instance, Li et al.  recently proposed sample-specific (as opposed to user-specific) privacy guarantees, thus providing a weaker form of privacy in exchange for more accurate models. Developing methods to handle mixed (device-specific or sample-specific) privacy restrictions is an interesting and ongoing direction of future work.
Beyond supervised learning. It is important to note that the methods discussed thus far have been developed with the task of supervised learning in mind, i.e., they assume that labels exist for all of the data in the federated network. In practice, much of the data generated in realistic federated networks may be unlabeled or weakly labeled. Furthermore, the problem at hand may not be to fit a model to data as presented in (1), but instead to perform some exploratory data analysis, determine aggregate statistics, or run a more complex task such as reinforcement learning. Tackling problems beyond supervised learning in federated networks will likely require addressing similar challenges of scalability, heterogeneity, and privacy.
Productionizing federated learning. Beyond the major challenges discussed in this article, there are a number of practical concerns that arise when running federated learning in production. In particular, issues such as concept drift (when the underlying data-generation model changes over time); diurnal variations (when the devices exhibit different behavior at different times of the day or week) ; and cold start problems (when new devices enter the network) must be handled with care. We defer the readers to , which discusses some of the practical systems-related issues that exist in production federated learning systems.
Benchmarks. Finally, as federated learning is a nascent field, we are at a pivotal time to shape the developments made in this area and ensure that they are grounded in real-world settings, assumptions, and datasets. It is critical for the broader research communities to further build upon existing implementations and benchmarking tools, such as LEAF  and TensorFlow Federated , to facilitate both the reproducibility of empirical results and the dissemination of new solutions for federated learning.
In this article, we have provided an overview of federated learning, a learning paradigm where statistical models are trained at the edge in distributed networks. We have discussed the unique properties and associated challenges of federated learning compared with traditional distributed data center computing and classical privacy-preserving learning. We provided an extensive survey on classical results as well as more recent work specifically focused on federated settings. Finally, we have outlined out a handful of open problems worth future research effort. Providing solutions to these problems will require interdisciplinary effort from a broad set of research communities.
Acknowledgement. We thank Jeffrey Li and Mikhail Khodak for helpful discussions and comments. This work was supported in part by DARPA FA875017C0141, the National Science Foundation grants IIS1705121 and IIS1838017, an Okawa Grant, a Google Faculty Award, an Amazon Web Services Award, a JP Morgan A.I. Research Faculty Award, a Carnegie Bosch Institute Research Award and the CONIX Research Center, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA, the National Science Foundation, or any other funding agency.
- We use the term ‘device’ throughout the article to describe entities in the network, such as nodes, clients, sensors, or organizations.
- (2016) Deep learning with differential privacy. In Conference on Computer and Communications Security, Cited by: §2.4.1.
- (2018) CpSGD: communication-efficient and differentially-private distributed sgd. In Advances in Neural Information Processing Systems, Cited by: §2.4.1, §2.4.2.
- (2000) Privacy-preserving data mining. In International Conference on Management of Data, Cited by: §2.4.1.
- (2019) Federated collaborative filtering for privacy-preserving personalized recommendation system. arXiv preprint arXiv:1901.09888. Cited by: §1.
- (2013) A public domain dataset for human activity recognition using smartphones.. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Cited by: §1.
- (2014) Private empirical risk minimization: efficient algorithms and tight error bounds. In Foundations of Computer Science, Cited by: §2.4.1.
- (2018) Protection against reconstruction and its applications in private federated learning. arXiv preprint arXiv:1812.00984. Cited by: §2.4.2, §2.4.2, §2.4.
- (2017) Adaptive neural networks for efficient inference. In International Conference on Machine Learning, Cited by: 2nd item.
- (2019) Towards federated learning at scale: system design. In Conference on Systems and Machine Learning, Cited by: §1.2, §1, §2.2.2, §2.2.3, §2.4.1, 7th item.
- (2017) Practical secure aggregation for privacy-preserving machine learning. In Conference on Computer and Communications Security, Cited by: §2.4.2.
- (2012) Fog computing and its role in the internet of things. In SIGCOMM Workshop on Mobile Cloud Computing, Cited by: §1.
- (2015) Machine learning classification over encrypted data.. In Network and Distributed System Security Symposium, Cited by: §2.4.1.
- (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3, pp. 1–122. Cited by: §2.1.1.
- (2018) Expanding the reach of federated learning by reducing client resource requirements. arXiv preprint arXiv:1812.07210. Cited by: §2.1.2.
- (2018) LEAF: a benchmark for federated settings. arXiv preprint arXiv:1812.01097. Cited by: 8th item.
- (2018) The secret sharer: measuring unintended neural network memorization & extracting secrets. arXiv preprint arXiv:1802.08232. Cited by: §1.2, §2.4.2, §2.4.
- (1997) Multitask learning. Machine Learning 28, pp. 41–75. Cited by: §2.3.1.
- (1999) Practical byzantine fault tolerance. In Operating Systems Design and Implementation, Cited by: §2.2.3.
- (2017) Approximate gradient coding via sparse random graphs. arXiv preprint arXiv:1711.0677. Cited by: §2.2.3.
- (2018) Gradient coding using the stochastic block model. In International Symposium on Information Theory, Cited by: §2.2.3.
- (2011) Differentially private empirical risk minimization. Journal of Machine Learning Research 12, pp. 1069–1109. Cited by: §2.4.1.
- (1988) The dining cryptographers problem: unconditional sender and recipient untraceability. Journal of Cryptology 1, pp. 65–75. Cited by: §2.4.1.
- (2018) Federated meta-learning for recommendation. arXiv preprint arXiv:1802.07876. Cited by: §2.3.1.
- (2019) Secure computation for machine learning with spdz. arXiv preprint arXiv:1901.00329. Cited by: §2.4.1.
- (2019) Variational federated multi-task learning. arXiv preprint arXiv:1906.06268. Cited by: §2.3.1.
- (2015) High-performance distributed ML at scale through parameter server consistency models. In AAAI Conference on Artificial Intelligence, Cited by: §2.2.1.
- (2012) Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research 13, pp. 165–202. Cited by: §2.1.1.
- (2005) Model-based approximate querying in sensor networks. The VLDB Journal 14, pp. 417–443. Cited by: §1.
- (2012) Privacy aware learning. In Advances in Neural Information Processing Systems, Cited by: §1.2.
- (2013) Estimation, optimization, and parallelism when data is sparse. In Advances in Neural Information Processing Systems, Cited by: §2.2.1.
- (2006) Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, Cited by: §2.4.1.
- (2014) The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9, pp. 211–407. Cited by: §1.2, §2.4.1.
- (2011) A firm foundation for private data analysis. Communications of the ACM 54, pp. 86–95. Cited by: §2.4.1.
- (2019) Semi-cyclic stochastic gradient descent. In International Conference on Machine Learning, Cited by: §2.3.1, 7th item.
- (2008) Protecting privacy using k-anonymity. Journal of the American Medical Informatics Association 15, pp. 627–637. Cited by: §2.4.1.
- (2004) Regularized multi–task learning. In Conference on Knowledge Discovery and Data Mining, Cited by: §2.3.1.
- (2018) Privacy amplification by iteration. In Foundations of Computer Science, Cited by: §2.4.1.
- (2015) Model inversion attacks that exploit confidence information and basic countermeasures. In Conference on Computer and Communications Security, Cited by: §2.4.
- (2015) Edge-centric computing: vision and challenges. SIGCOMM Computer Communication Review 45, pp. 37–42. Cited by: §1.
- (2017) Differentially private federated learning: a client level perspective. arXiv preprint arXiv:1712.07557. Cited by: §2.4.2, §2.4.2.
- (2019) Scalable and differentially private distributed aggregation in the shuffled model. arXiv preprint arXiv:1906.08320. Cited by: §2.4.2.
- (2015) A comprehensive comparison of multiparty secure additions with differential privacy. IEEE Transactions on Dependable and Secure Computing 14, pp. 463–477. Cited by: §2.4.1.
- (2018) Model aggregation via good-enough model spaces. arXiv preprint arXiv:1805.07782. Cited by: 1st item.
- (2019) One-shot federated learning. arXiv preprint arXiv:1902.11175. Cited by: 1st item.
- (2018) Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604. Cited by: 1st item, §1.
- (2018) Cola: decentralized linear learning. In Advances in Neural Information Processing Systems, Cited by: §2.1.3, §2.2.2.
- (2013) More effective distributed ML via a stale synchronous parallel parameter server. In Advances in Neural Information Processing Systems, Cited by: §2.2.1.
- (2013) Mobile fog: a programming model for large-scale applications on the internet of things. In SIGCOMM Workshop on Mobile Cloud Computing, Cited by: §1.
- (2013) An in-depth study of lte: effect of network protocol and application behavior on performance. SIGCOMM Computer Communication Review 43, pp. 363–374. Cited by: §1.2.
- (2019) Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records. arXiv preprint arXiv:1903.09296. Cited by: §1.
- (2018) LoAdaBoost: loss-based adaboost federated machine learning on medical data. arXiv preprint arXiv:1811.12629. Cited by: 2nd item, §1, §2.3.1, §2.3.2.
- (2019) Towards practical differentially private convex optimization. In Conference on Computer and Communications Security, Cited by: §2.4.1.
- (2014) Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems, Cited by: §2.1.1.
- (2018) Communication-efficient on-device machine learning: federated distillation and augmentation under non-iid private data. arXiv preprint arXiv:1811.11479. Cited by: §2.3.2.
- (2018) A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Advances in Neural Information Processing Systems, Cited by: §2.2.3, §2.3.2.
- (2019) Incentive design for efficient federated learning in mobile networks: a contract theory approach. arXiv preprint arXiv:1905.07479. Cited by: §2.2.2.
- (2019) Adaptive gradient-based meta-learning methods. arXiv preprint arXiv:1906.02717. Cited by: §2.3.1.
- (2016) Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §2.1.2.
- (2012) Challenges and solutions of ubiquitous user modeling. In Ubiquitous Display Environments, Cited by: §1.
- (2019) Decentralized bayesian learning over graphs. arXiv preprint arXiv:1905.10466. Cited by: §2.1.3.
- (2015) Distributed box-constrained quadratic optimization for dual linear svm. In International Conference on Machine Learning, Cited by: §2.1.1.
- (2017) Speeding up distributed machine learning using codes. IEEE Transactions on Information Theory 64, pp. 1514–1529. Cited by: §2.2.3.
- (2019) Differentially-private gradient-based meta-learning. Technical Report. Cited by: §1.2, §2.4.2, §2.4.2, 5th item.
- (2018) Federated optimization for heterogeneous networks. arXiv preprint arXiv:1812.06127. Cited by: §2.1.1, §2.2.2, §2.2.3, §2.3.2, 4th item.
- (2019) Fair resource allocation in federated learning. arXiv preprint arXiv:1905.10497. Cited by: §2.3.1.
- (2017) Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, Cited by: §2.1.3.
- (2018) Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217. Cited by: §2.1.3, §2.3.2.
- (2000) Privacy preserving data mining. In Advances in Cryptology, Cited by: §2.4.1.
- (2019) Edge-assisted hierarchical federated learning with non-iid data. arXiv preprint arXiv:1905.06641. Cited by: §2.1.3.
- (2013) Data center networks: topologies, architectures and fault-tolerance characteristics. Springer Science & Business Media. Cited by: §2.2.3.
- (2015) Adding vs. averaging in distributed primal-dual optimization. In International Conference on Machine Learning, Cited by: §2.1.1.
- (2011) Divide-and-conquer matrix factorization. In Advances in Neural Information Processing Systems, Cited by: 1st item.
- (2005) TinyDB: an acquisitional query processing system for sensor networks. Transactions on Database Systems 30, pp. 122–173. Cited by: §1.
- (2017) Communication-efficient learning of deep networks from decentralized data. In Conference on Artificial Intelligence and Statistics, Cited by: §1, §2.1.1, §2.2.2, §2.3.2.
- (2018) Learning differentially private recurrent language models. In International Conference on Learning Representations, Cited by: §1.2, §2.4.1, §2.4.2, §2.4.2.
- (2016) Efficient private statistics with succinct sketches. In Network and Distributed System Security Symposium, Cited by: §2.4.1.
- (2019) Exploiting unintended feature leakage in collaborative learning. In IEEE Symposium on Security & Privacy, Cited by: §2.4.
- (2018) ABY 3: a mixed protocol framework for machine learning. In Conference on Computer and Communications Security, Cited by: §2.4.1.
- (2019) Agnostic federated learning. In International Conference on Machine Learning, Cited by: §2.3.1.
- (2010) -Presence without complete world knowledge. IEEE Transactions on Knowledge and Data Engineering 22, pp. 868–883. Cited by: §2.4.1.
- (2013) Privacy-preserving ridge regression on hundreds of millions of records. In Symposium on Security and Privacy, Cited by: §2.4.1.
- (2019) Client selection for federated learning with heterogeneous resources in mobile edge. In International Conference on Communications, Cited by: §2.2.2.
- (2010) A survey on wearable sensor-based systems for health monitoring and prognosis. IEEE Transactions on Systems, Man, and Cybernetics 40, pp. 1–12. Cited by: 3rd item, §1.
- (2017) Semi-supervised knowledge transfer for deep learning from private training data. In International Conference on Learning Representations, Cited by: §2.4.1.
- (2018) Scalable private learning with pate. In International Conference on Learning Representations, Cited by: §2.4.1.
- (2019) Fault tolerance in iterative-convergent machine learning. In International Conference on Machine Learning, Cited by: §2.2.3.
- (2015) Quartz: randomized dual coordinate ascent with arbitrary sampling. In Advances in Neural Information Processing Systems, Cited by: §2.1.1.
- (2019) Federated learning for emoji prediction in a mobile keyboard. arXiv preprint arXiv:1906.04329. Cited by: 1st item, §1.
- (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, Cited by: §1.
- (2019) SysML: the new frontier of machine learning systems. arXiv preprint arXiv:1904.03257. Cited by: §1.
- (2011) Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, Cited by: §2.2.1.
- (2016) AIDE: fast and communication efficient distributed optimization. arXiv preprint arXiv:1608.06879. Cited by: §2.1.1, §2.3.2.
- (2019) Coded computation over heterogeneous clusters. IEEE Transactions on Information Theory 65, pp. 4227–4242. Cited by: §2.2.3.
- (2018) Chameleon: a hybrid secure computation framework for machine learning applications. In Asia Conference on Computer and Communications Security, Cited by: §2.4.1.
- (2016) Distributed coordinate descent method for learning with big data. Journal of Machine Learning Research 17, pp. 2657–2681. Cited by: §2.1.1.
- (2018) Deepsecure: scalable provably-secure deep learning. In Design Automation Conference, Cited by: §2.4.1.
- (2018) Federated learning for ultra-reliable low-latency v2v communications. In Global Communications Conference, Cited by: 3rd item.
- (2019) Robust and communication-efficient federated learning from non-iid data. arXiv preprint arXiv:1903.02891. Cited by: §2.1.2.
- (2013) Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370. Cited by: 4th item.
- (2014) 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In International Speech Communication Association, Cited by: §2.1.2.
- (2013) Accelerated mini-batch stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, Cited by: §2.1.1.
- (2014) Distributed stochastic optimization and learning. In Allerton Conference on Communication, Control, and Computing, Cited by: §2.1.1.
- (2014) Communication-efficient distributed optimization using an approximate newton-type method. In International Conference on Machine Learning, Cited by: §2.3.2.
- (2018) Federated learning in distributed medical databases: meta-analysis of large-scale subcortical brain data. arXiv preprint arXiv:1810.08553. Cited by: §1.
- (2017) Federated multi-task learning. In Advances in Neural Information Processing Systems, Cited by: §1.2, §2.1.1, §2.2.2, §2.2.3, §2.3.1.
- (2018) CoCoA: a general framework for communication-efficient distributed optimization. Journal of Machine Learning Research 18, pp. 1–47. Cited by: §2.1.1.
- (2019) Local sgd converges fast and communicates little. In International Conference on Learning Representations, Cited by: §2.1.1, §2.3.2.
- (2017) Gradient coding: avoiding stragglers in distributed learning. In International Conference on Machine Learning, Cited by: §2.2.3.
- (2007) Distributed systems: principles and paradigms. Prentice-Hall. Cited by: §2.2.3.
- (2018) Communication compression for decentralized training. In Advances in Neural Information Processing Systems, Cited by: §2.1.2.
- (2019) Distributed learning over unreliable networks. In International Conference on Machine Learning, Cited by: §2.2.3.
- (Website) External Links: Cited by: 8th item.
- (2019) Differentially private learning with adaptive clipping. arXiv preprint arXiv:1905.03871. Cited by: §2.4.2, §2.4.2.
- (2012) Learning to learn. Springer Science & Business Media. Cited by: §2.3.1.
- (2009) Multi-core for mobile phones. In Conference on Design, Automation and Test in Europe, Cited by: §1.2.
- (2019) Fast and faster convergence of sgd for over-parameterized models (and an accelerated perceptron). In Conference on Artificial Intelligence and Statistics, Cited by: 4th item.
- (2019) Reducing leakage in distributed deep learning for sensitive health data. arXiv preprint arXiv:1812.00564. Cited by: §2.4.1.
- (2018) Technical privacy metrics: a systematic survey. ACM Computing Surveys 51, pp. 57. Cited by: §2.4.1.
- (2018) Atomo: communication-efficient learning via atomic sparsification. In Advances in Neural Information Processing Systems, Cited by: §2.1.2.
- (2018) Cooperative sgd: a unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576. Cited by: §2.3.2.
- (2019) Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd. In Conference on Systems and Machine Learning, Cited by: §2.3.2.
- (2019) Adaptive federated learning in resource constrained edge computing systems. Journal on Selected Areas in Communications 37, pp. 1205–1221. Cited by: §2.2.3, §2.3.2.
- (2018) Giant: globally improved approximate newton method for distributed optimization. In Advances in Neural Information Processing Systems, Cited by: §2.3.2.
- (2018) Federated learning white paper v1.0. Cited by: §1.
- (2018) Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In Advances in Neural Information Processing Systems, Cited by: §2.3.2.
- (2017) Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In International Conference on Management of Data, Cited by: §2.4.1.
- (2019) Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology 10, pp. 12. Cited by: §1.
- (2013) Trading computation for communication: distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, Cited by: §2.1.1.
- (2007) On early stopping in gradient descent learning. Constructive Approximation 26, pp. 289–315. Cited by: 1st item.
- (2018) Gradient diversity: a key ingredient for scalable distributed learning. In Conference on Artificial Intelligence and Statistics, pp. 1998–2007. Cited by: 4th item.
- (2019) On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In International Conference on Machine Learning, Cited by: §2.2.3, §2.3.2.
- (2018) Parallel restarted sgd for non-convex optimization with faster convergence and less communication. In AAAI Conference on Artificial Intelligence, Cited by: §2.2.3, §2.3.2.
- (2013) Privacy preserving back-propagation neural network learning made practical with cloud computing. IEEE Transactions on Parallel and Distributed Systems 25, pp. 212–221. Cited by: §2.4.1.
- (2019) Bayesian nonparametric federated learning of neural networks. In International Conference on Machine Learning, Cited by: 1st item.
- (2017) ZipML: training linear models with end-to-end low precision, and a little bit of deep learning. In International Conference on Machine Learning, Cited by: §2.1.2.
- (2015) Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, Cited by: §2.1.1, §2.3.2.
- (2015) Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. Journal of Machine Learning Research 16, pp. 3299–3340. Cited by: 1st item.
- (2019) Mobile edge computing, blockchain and reputation-based crowdsourcing iot federated learning: a secure, decentralized and privacy-preserving system. arXiv preprint arXiv:1906.10893. Cited by: §1.
- (2018) Federated learning with non-iid data. arXiv preprint arXiv:1806.00582. Cited by: §2.3.1, §2.3.2, 4th item.
- (2018) On the convergence properties of a -step averaging stochastic gradient descent algorithm for nonconvex optimization. In International Joint Conference on Artificial Intelligence, Cited by: §2.3.2.
- (2010) Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems, Cited by: §2.2.1.