On the Communication Latency of Wireless Decentralized Learning
We consider a wireless network comprising nodes located within a circular area of radius , which are participating in a decentralized learning algorithm to optimize a global objective function using their local datasets. To enable gradient exchanges across the network, we assume each node communicates only with a set of neighboring nodes, which are within a distance of itself, where . We use tools from network information theory and random geometric graph theory to show that the communication delay for a single round of exchanging gradients on all the links throughout the network scales as , increasing (at different rates) with both the number of nodes and the gradient exchange threshold distance.
With the advent of novel powerful computing platforms, alongside the availability of large-scale datasets, machine learning (ML), and particularly, deep learning, have gained significant interest in recent years . Such developments have also contributed to the invention of more advanced ML architectures and more efficient training mechanisms , which have resulted in state-of-the-art performance in many domains, such as computer vision , natural language processing , health-care , etc.
More recently, however, there has been an increasing awareness in consumers of services, which are driven by ML models, regarding the privacy of their data. Depending on how sensitive the data type is or how often it is collected, each user has their own privacy concerns and preferences . Such trends have been coincident with the proliferation of mobile computing solutions, which provide devices, such as smart-home devices, cell phones, laptops, and drones, with strong computation capabilities .
These societal and technical trends have given rise to paradigms such as federated and decentralized learning, where the generated data by each device stays on-board to protect its privacy [8, 9]. To compensate for that, (part of) the computation is also shifted to be done locally at the end-user devices. It has been shown that in many cases, distributing the learning process over different nodes incurs negligible performance loss compared to centralized training approaches .
However, one major bottleneck in all the aforementioned paradigms is the communication network between the learning nodes. As the data points generated by each node differ from the rest of the rest of the network, the nodes need to periodically communicate with each other so that they all converge to the same model, rather than diverging to completely different models. If the communication that needs to occur between the nodes in the network induces sizeable delays, it can significantly lengthen the convergence time across the network, as it can totally dominate the computation delay at the learning nodes.
This phenomenon has motivated a massive body of recent work on dealing with the communication delays for federated and decentralized learning. In , a setting with a single server and multiple worker nodes is considered, where at each iteration, a subset of worker nodes are selected, either by the server or by the worker nodes themselves, to send their gradients to the server. In , a simple network of multiple worker nodes is considered, over which they can all exchange their computation results with a fixed amount of delay, in conjunction with a server which aggregates all the results and sends back updated parameters to the worker nodes. In , gossiping algorithms and convergence guarantees are provided for decentralized optimization with compressed communication. In , it is shown how specific connectivity of the communication network topology among learning nodes affects the speed of convergence. In , convergence results are derived for a combination of quantization, sparsification and local computation in a distributed computation setting with a single master and multiple worker nodes. In , a deadline-based approach for minibatch gradient computing at each computing node is proposed, such that the minibatch size is adaptive to the computation capabilities of each node, hence making the scheme robust to stragglers.
Most of the above works deal with an abstract model for the communication network among the learning nodes. One particularly interesting communication paradigm to consider is wireless communication, especially as operators around the world roll out their 5G network infrastructure. There have been some recent works that have considered wireless constraints, mostly in the context of federated learning [17, 18, 19, 20, 21].
In this paper, we consider the decentralized learning scenario over a network of learning nodes connected together through a shared wireless medium. Considering the nature of wireless networks, in which nodes in proximity can more efficiently communicate with each other, while interfering at concurrent transmissions, we attempt to characterize the communication delay for exchanging the gradients among the learning nodes over the wireless network topology. In particular, we consider a setting similar to , where at each time, a set of non-interfering gradient exchanges are scheduled to happen simultaneously. Using the results on the optimality of treating interference as noise in interference networks , we present an algorithm for gradient exchanges in wireless decentralized learning akin to the information-theoretic link scheduling that was proposed in  for the case of device-to-device networks.
We utilize tools from random geometric graph theory to characterize the asymptotic communication latency for exchanging gradients in the aforementioned decentralized setting framework. In particular, we consider a network of learning nodes located within a circle of radius , where each node exchanges gradients with its neighboring nodes, which are within a distance of itself, where is a variable that controls the density of the gradient exchange topology. This threshold distance needs to decrease with , as the entire network needs to remain connected to guarantee the convergence of the decentralized learning algorithm. We show that as , the communication latency scales as , increasing with the number of users, and decreasing with . This result provides insights on how much communication time is needed in a wireless decentralized learning scenario, where more gradient exchanges leads to longer communication latencies, but faster convergence rates.
Ii System Model
Consider a wireless network consisting of nodes dropped uniformly at random within a circular area of radius . Assume that each node has access to a set of data points , and the goal is to minimize a global loss function , defined over a set of optimization parameters , using the overall dataset across the network as
where is the local loss function at node , and is the stochastic loss function for sample given model parameters . In order to solve this problem, decentralized stochastic gradient descent (SGD) can be utilized to minimize the objective function in an iterative fashion. In decentralized SGD, the system is run over multiple iterations, where at each iteration, each node performs a local computation of the gradient of the objective function with respect to the set of optimization parameters over (a minibatch of) its local dataset, following which the gradients are exchanged among nodes prior to the beginning of the next iteration.
Due to the path-loss and fading effects in wireless communications, nodes can more easily communicate to their closer neighbors than farther ones. Therefore, we define the communication graph as the network topology which dictates how nodes exchange gradients with their neighboring nodes, and we model it as an undirected random geometric graph (RGG) , where is the set of all nodes in the network, and for every , where , if and only if , where denotes the distance between nodes and , and is the threshold distance for gradient exchange; i.e., two nodes can exchange their gradients with each other if and only if they are located within a distance of at most .
However, activating multiple gradient exchanges over the wireless channel at the same time will lead to interference, which can significantly reduce the network performance in terms of the throughput, and therefore, the communication delay. To capture the interference among concurrent wireless transmissions, we also define a conflict graph . In this graph, each vertex represents a communication link in the original communication graph, i.e., . Moreover, there is an edge between two vertices in if their activations are in conflict; i.e., if transmitting data (i.e., gradients) on those links at the same time strongly interfere on each other. Since the level of interference also depends on the distance of transmitting/receiving nodes, we introduce a conflict distance , where for two vertices , there is an edge between and , i.e., , if and only if
which implies that at least one node in is within conflict distance of . Note that for the case of , , implying that there is a conflict between and , for any two neighbors of node in the original communication graph. This means that a node cannot communicate with two nodes at the same time (i.e., half-duplex and single frequency band constraints).
Given the above definitions, our goal is to determine the asymptotic behavior of the normalized gradient exchange latency (as ), which is defined as the delay for completing the exchange of 1 bit of gradients on all the links of the communication graph. Assuming that the communication delay in the network dominates the gradient computation delay at each node, the normalized gradient exchange latency characterizes the wall-clock run time per iteration for decentralized SGD on a wireless communication network of learning nodes.
Ii-a Wireless Communication Model
We assume each node is equipped with a single transmit/receive antenna, and all transmissions happen in a synchronous time-slotted manner on a single frequency band. We restrict the transmission strategies to an on/off pattern: At each time slot, a node either transmits a message to another node with full power or stays completely silent. We use as a transmission status indicator of node at time slot ; i.e., if and only if node is transmitting with full power at time slot . On the receiver side, we adopt the simple and practical scheme of treating interference as noise (TIN), where each node decodes its desired message, while treating the interference from all other concurrent transmissions as noise. Letting denote the noise variance, the rate achieved on a link from node to node at time can be written as
where denotes the channel gain on the link between nodes and . In this paper, we adopt a single-slope path-loss model for the channel gains, where the channel gain at distance can be written as
where is the reference channel gain at a distance of , and denotes the path-loss exponent. This implies that the achievable rate in (1) can be written as
where denotes the signal-to-noise ratio (SNR) at a distance of .
Iii Forming the Communication and Conflict Graphs
The communication network topology needs to be carefully designed, as decentralized SGD will not converge if the gradient exchange communication graph is disconnected . We resort to the following lemma, which provides a sufficient condition for connectivity of random geometric graphs.
Lemma 1 (Corollary 3.1 in ).
In an RGG with nodes and a threshold distance of , the graph is connected with probability one (as ) if , where .
In light of Lemma 1, for the communication graph, we set the gradient exchange threshold distance as
which decreases as the number of nodes increases so as to satisfy the condition in Lemma 1, hence maintaining the connectivity of the entire graph.
Now, to build the conflict graph, we use the following result, derived in , for approximate information-theoretic optimality of TIN in wireless networks.
Theorem 1 (Theorem 4 in ).
Consider a wireless network with transmitter-receiver pairs , where denotes the signal-to-noise ratio between and , and denotes the interference-to-noise ratio between and . Then, under the following condition,
TIN achieves the entire information-theoretic capacity region of the network (as defined in ) to within a gap of per dimension.
Theorem 1 immediately leads to the following corollary.
In a network with transmitter-receiver pairs, if the minimum SNR and the maximum INR across the whole network (denoted by and , respectively) satisfy , then TIN is information-theoretically optimal to within a gap of per dimension.
As mentioned in Section II, the received power at distance can be written as . Hence, given the RGG nature of the communication and conflict graphs, we can bound the SNR and INR values across the network as
Thus, to guarantee the optimality of TIN, while having the sparsest conflict graph, we set the conflict distance as
Iv Main Result
In this section, we present our main result on the time needed for exchanging gradients over the communication graph as follows.
For a sufficiently large network of learning nodes (), the normalized gradient exchange latency satisfies
Theorem 2 implies that the normalized gradient exchange latency can be upper-bounded in an order-wise fashion (for ) as
Theorem 2 characterizes an achievable normalized gradient exchange latency over the communication graph. Figure 1 demonstrates how this latency changes with and for the case where nodes are dropped within a circular area of radius m, transmit power is assumed to be dBm, noise power spectral density is taken to be dBm/Hz, the bandwidth is MHz, the path-loss exponent is equal to , and the reference channel gain is set to . As demonstrated by (6) and its order-wise approximation in (7), as well Figure 1, the delay of exchanging gradients over all links in the conflict graph monotonically increases with , which is expected as increasing the network size, while keeping the communication graph connected, will require an increasing number of gradient exchanges among neighboring nodes.
On the other hand, the latency decreases (approximately) exponentially with . As per (2), determines the threshold distance for gradient exchange among adjacent nodes; Increasing will reduce the number of neighbors with which each node exchanges gradients, and this provides a significant saving in terms of the communication latency. Note that this comes at the expense of slower convergence rate for the global loss function, as it will take longer for each node to obtain access to the gradients from datasets available in farther nodes.
V Achievable Scheme
In this section, we prove our main result in Theorem 2 by providing an achievable scheme for gradient exchange on all links in the communication graph and characterizing an upper bound on its achievable normalized gradient exchange latency.
Given the communication and conflict graphs, the nodes can exchange gradients with their neighbors in the communication graph as long as their exchanges are non-conflicting; i.e., there is not an edge between them in the conflict graph. This leads to the notion of independent sets on the conflict graph, where each such independent set contains a set of nodes such that there is no edge between them. This is closely related to the notion of information-theoretic independent sets as defined in  for device-to-device communication networks. It is also analog to the concept of matchings on the communication topology as considered in , where now the interference between active communication links is also taken into account.
We first start with the following lemma, in which we characterize a lower bound on the symmetric rate within an independent set of the conflict graph, defined as the rate that can be simultaneously achieved by all the corresponding active links in the communication graph.
For any independent set in , the symmetric rate is lower-bounded by
For every vertex , the achievable rate on the corresponding link from node to node in can be written as
where (9) follows from the fact that link is present in the communication graph, hence their distance satisfies , while the link between nodes and is not present in the conflict graph, implying that . Moreover, (10) follows from the definition of in (5), and from the fact that as , the interference grows larger than noise, i.e., . As all nodes are able to achieve this communication rate, the proof is complete. ∎
Next, we present the following lemma, which provides an upper bound on the chromatic number of the conflict graph.
The chromatic number of the conflict graph can be asymptotically upper-bounded by
Considering each vertex in the conflict graph, its degree can be upper-bounded as
where is the maximum degree of a random geometric graph with nodes and threshold distance of , and is the maximum degree of , which is a random geometric graph with nodes and threshold distance of . As per equation (4) in , (11) can be upper bounded by
where denotes the clique number of a random geometric graph with nodes and threshold distance , defined as the size of the largest clique in the graph, i.e., the maximal subset of vertices in which every two vertices are connected.
Theorem 3 (Theorem 1.2 in ).
For a -dimensional random geometric graph with nodes and threshold distance , if , then its clique number, denoted by , satisfies
where is the unit ball in and is the maximum density of the distribution of nodes in . For Euclidean distance in and uniform distribution of nodes within a circle of radius , and .
Using a greedy coloring algorithm on the conflict graph, its chromatic number can be upper bounded by , where is the maximum degree of the vertices in . Combined with (14), this completes the proof. ∎
Having Lemmas 2 and 3, we now proceed to prove Theorem 2. Suppose that we have a proper coloring on the conflict graph with colors, where the independent set corresponding to each color is denoted by . Then, assuming that all independent sets use time-sharing to exchange the gradients, we can bound the normalized gradient exchange latency as
where is defined as
Now, note that is equal to the total number of vertices in the conflict graph, or the edges in the communication graph; i.e.,
Proposition A.1 in  suggests that the average degree of a 2-dimensional random geometric graph with nodes dropped uniformly at random within a circular area of radius and a threshold distance asymptotically converges to . Therefore, we have
which together with (18) leads to
Appendix A Proof of Concavity of in (17) for
Letting , we can write the first derivative of as
which leads to the second derivative of as
We can write the derivative in the numerator of (20) as
Appendix B Proof of Monotonicity of the Bound in (19)
- In this paper, we use the short-hand notation to denote the natural logarithm operation .
- I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
- I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS’14. Cambridge, MA, USA: MIT Press, 2014, p. 3104–3112.
- M. K. Leung, H. Y. Xiong, L. J. Lee, and B. J. Frey, “Deep learning of the tissue-regulated splicing code,” Bioinformatics, vol. 30, no. 12, pp. i121–i129, 2014.
- P. E. Naeini, S. Bhagavatula, H. Habib, M. Degeling, L. Bauer, L. F. Cranor, and N. Sadeh, “Privacy expectations and preferences in an IoT world,” in Thirteenth Symposium on Usable Privacy and Security (SOUPS 2017). Santa Clara, CA: USENIX Association, Jul. 2017, pp. 399–412. [Online]. Available: https://www.usenix.org/conference/soups2017/technical-sessions/presentation/naeini
- J. Poushter et al., “Smartphone ownership and internet usage continues to climb in emerging economies,” Pew Research Center, vol. 22, pp. 1–44, 2016.
- H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al., “Communication-efficient learning of deep networks from decentralized data,” arXiv preprint arXiv:1602.05629, 2016.
- M. Kamp, L. Adilova, J. Sicking, F. Hüger, P. Schlicht, T. Wirtz, and S. Wrobel, “Efficient decentralized deep learning by dynamic model averaging,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2018, pp. 393–409.
- Y. Zhang, J. C. Duchi, and M. J. Wainwright, “Communication-efficient algorithms for statistical optimization,” The Journal of Machine Learning Research, vol. 14, no. 1, pp. 3321–3363, 2013.
- T. Chen, G. Giannakis, T. Sun, and W. Yin, “LAG: Lazily aggregated gradient for communication-efficient distributed learning,” in Advances in Neural Information Processing Systems, 2018, pp. 5050–5060.
- K. Scaman, F. Bach, S. Bubeck, L. Massoulié, and Y. T. Lee, “Optimal algorithms for non-smooth distributed optimization in networks,” in Advances in Neural Information Processing Systems, 2018, pp. 2740–2749.
- A. Koloskova, S. U. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” arXiv preprint arXiv:1902.00340, 2019.
- J. Wang, A. K. Sahu, Z. Yang, G. Joshi, and S. Kar, “MATCHA: Speeding up decentralized SGD via matching decomposition sampling,” arXiv preprint arXiv:1905.09435, 2019.
- D. Basu, D. Data, C. Karakus, and S. Diggavi, “Qsparse-local-SGD: Distributed SGD with quantization, sparsification, and local computations,” arXiv preprint arXiv:1906.02367, 2019.
- A. Reisizadeh, H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani, “Robust and communication-efficient collaborative learning,” in Advances in Neural Information Processing Systems, 2019, pp. 8386–8397.
- M. M. Amiri and D. Gunduz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” arXiv preprint arXiv:1901.00844, 2019.
- J.-H. Ahn, O. Simeone, and J. Kang, “Wireless federated distillation for distributed edge learning with heterogeneous data,” in 2019 IEEE 30th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC). IEEE, 2019, pp. 1–6.
- Q. Zeng, Y. Du, K. K. Leung, and K. Huang, “Energy-efficient radio resource allocation for federated edge learning,” arXiv preprint arXiv:1907.06040, 2019.
- H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling policies for federated learning in wireless networks,” IEEE Transactions on Communications, 2019.
- M. M. Amiri, D. Gunduz, S. R. Kulkarni, and H. V. Poor, “Update aware device scheduling for federated learning at the wireless edge,” arXiv preprint arXiv:2001.10402, 2020.
- C. Geng, N. Naderializadeh, A. S. Avestimehr, and S. A. Jafar, “On the optimality of treating interference as noise,” IEEE Transactions on Information Theory, vol. 61, no. 4, pp. 1753–1767, 2015.
- N. Naderializadeh and A. S. Avestimehr, “ITLinQ: A new approach for spectrum sharing in device-to-device communication systems,” IEEE Journal on Selected Areas in Communications, vol. 32, no. 6, pp. 1139–1151, 2014.
- P. Gupta and P. R. Kumar, “Critical power for asymptotic connectivity in wireless networks,” in Stochastic analysis, control, optimization and applications. Springer, 1999, pp. 547–566.
- L. Decreusefond, P. Martins, and A. Vergne, “Clique number of random geometric graphs,” 2013, working paper or preprint. [Online]. Available: https://hal.archives-ouvertes.fr/hal-00864303
- C. McDiarmid and T. Müller, “On the chromatic number of random geometric graphs,” Combinatorica, vol. 31, no. 4, pp. 423–488, 2011.
- T. Müller, “Two-point concentration in random geometric graphs,” Combinatorica, vol. 28, no. 5, p. 529, 2008.