Federated Learning over Wireless Device-to-Device Networks: Algorithms and Convergence Analysis

# Federated Learning over Wireless Device-to-Device Networks: Algorithms and Convergence Analysis

## Abstract

The proliferation of Internet-of-Things (IoT) devices and cloud-computing applications over siloed data centers is motivating renewed interest in the collaborative training of a shared model by multiple individual clients via federated learning (FL). To improve the communication efficiency of FL implementations in wireless systems, recent works have proposed compression and dimension reduction mechanisms, along with digital and analog transmission schemes that account for channel noise, fading, and interference. This prior art has mainly focused on star topologies consisting of distributed clients and a central server. In contrast, this paper studies FL over wireless device-to-device (D2D) networks by providing theoretical insights into the performance of digital and analog implementations of decentralized stochastic gradient descent (DSGD). First, we introduce generic digital and analog wireless implementations of communication-efficient DSGD algorithms, leveraging random linear coding (RLC) for compression and over-the-air computation (AirComp) for simultaneous analog transmissions. Next, under the assumptions of convexity and connectivity, we provide convergence bounds for both implementations. The results demonstrate the dependence of the optimality gap on the connectivity and on the signal-to-noise ratio (SNR) levels in the network. The analysis is corroborated by experiments on an image-classification task.

Federated learning, distributed learning, decentralized stochastic gradient descent, over-the-air computation, D2D networks.

## I Introduction

With the proliferation of Internet-of-Things (IoT) devices and cloud-computing applications over siloed data centres, distributed learning has become a critical enabler for artificial intelligence (AI) solutions [6, 8]. In distributed learning, multiple agents collaboratively train a machine learning model via the exchange of training data, model parameters and/or gradient vectors over geographically distributed computing resources and data. Federated learning (FL) refers to distributed learning protocols that do not directly exchange the training data in an attempt to reduce the communication load and to limit privacy concerns [14, 17, 36, 37]. In conventional FL, multiple clients train a shared model by exchanging model-related parameters with a central node. This class of protocols hence relies on a parameter-server architecture, which is typically realized in wireless settings via a base station-centric network topology [3, 7, 1, 28, 5, 4].

There are important scenarios in which a base station-centric architecture is either unavailable or undesirable due to coverage, privacy, implementation efficiency, or fault-tolerance considerations [31, 34]. In such cases, distributed learning must rely on a peer-to-peer, edge-based, communication topology that encompasses device-to-device (D2D) links among individual learning agents over an arbitrary connectivity graph. With the exception of [33, 21], all prior works on decentralized FL nevertheless assume either ideal or rate-limited but noiseless D2D communications. This paper is the first to offer a rigorous convergence analysis of digital and analog implementations of wireless D2D FL, with the aim for providing insights on the effect of wireless impairments caused by link blockages, pathloss, channel fading, and interference.

### I-a Related Work

The problem of alleviating the communication load in FL systems has been widely investigated, mostly under the assumption of noiseless, rate-limited links, and star topologies. Key elements of these solutions are compression and dimension-reduction operations that map the original model parameters or gradient vectors into representations defined by a limited number of bits and/or sparsity. Important classes of solutions include unbiased compressors (e.g., [3, 7, 1]) and biased compressors with error-feedback mechanisms (e.g., [28, 5, 4]).

In a D2D architecture, even in the presence of noiseless communication, devices can only exchange information with their respective neighbors, making consensus mechanisms essential to ensure agreement towards the common learning goal [19, 22]. A well-known protocol integrating stochastic gradient (SGD) and consensus is Decentralized Stochastic Gradient Descent (DSGD), which has been further extended and improved via gradient tracking [32], as well as via variance-reduction schemes for large data heterogeneity among agents [25]. Similar to FL in star topologies, the communication overhead in decentralized learning can be reduced via compression, as demonstrated by the CHOCO-SGD algorithm [16, 15]. The protocol, combining DSGD with biased compression, was studied for strongly convex and smooth objectives in [16] and for non-convex smooth objectives in [15]. It was further combined with event-triggered protocols in [23].

A large number of recent works have proposed communication strategies and multi-access protocols for FL in wireless star topologies [9, 38, 35, 10, 20]. At the physical layer, over-the-air computation (AirComp) was investigated in [4, 2, 38, 18, 11, 27] as a promising solution to support simultaneous transmissions by leveraging the waveform superposition property of the wireless medium. Unlike conventional digital communication over orthogonal transmission blocks, AirComp is based on analog, e.g., uncoded, transmission, which enables the estimate of aggregated statistics directly from the received baseband samples. This reduces the communication burden, relieving the network from the need to decode individual information separately for all participating devices. The impact of AirComp on the performance of FL was studied in [11, 27]. The authors in [11] proposed an adaptive learning-rate scheduler and investigated the convergence of the resulting protocol. Reference [27] derived sufficient conditions in terms of signal-to-noise ratio (SNR) for the FL algorithm to attain the same convergence rate as in the ideal noiseless case.

The literature on decentralized FL in wireless D2D architecture is, in contrast, still quite limited. A DSGD based algorithm termed MATCHA was proposed in [26] by accounting for interference among nearby links. Adopting a general interference graph, the scheme was based on sampling a matching decomposition of the graph, whereby the connectivity-critical links are activated with a higher probability. By relying on a matching decomposition, MATCHA schedules noiseless non-interfering communication links in parallel. No attempt was made to leverage non-orthogonal physical layer protocols such as AirComp. In contrast, in the conference version [33] of this paper, wireless protocols for both digital and analog implementations of error-compensated DSGD were designed by including AirComp. However, no theoretical analysis was offered on the convergence of the considered wireless decentralized FL algorithms.

### I-B Main Contributions

In this paper, we provide for the first time a rigorous analysis of digital and analog transmission protocols tailored to FL over wireless D2D networks in terms of their convergence properties. The contributions of this paper are specifically summarized as follows.

1) We introduce generic digital and analog wireless implementations of DSGD algorithms. The protocols rely on compression via random linear coding (RLC) as applied to model differential information. The general protocols enable broadcasting for digital transmission; and both broadcasting and AirComp for analog transmission.

2) Under the assumptions of convexity and connectivity, we derive convergence bounds for the generic digital wireless implementation. The result demonstrates the dependence of the optimality gap on the connectivity of the graph and on the model differential estimation error due to compression.

3) We also provide convergence bounds for the analog wireless implementation. The analysis reveals the impact of topology and channel noise, as well as the importance of implementing an adaptive consensus step size. To the best of our knowledge, this is the first time that an adaptive consensus step size is shown to be beneficial for convergence.

4) We provide numerical experiments for image classification, confirming the benefits of the proposed adaptive consensus rate, and demonstrating the agreement between analytical and empirical results.

The remainder of this paper is organized as follows. The system model is presented in Section II. Digital and analog transmission protocols are introduced in Section III. The convergence analysis for both implementations is presented in Section IV and Section V, respectively. Numerical performance results are described in Section VI, followed by conclusions in Section VII.

### I-C Notations

We use the upper case boldface letters for matrices and lower case boldface letters for vectors. We also use to denote the Euclidean norm of a vector or the spectral norm of a matrix, and to denote the Frobenius norm of a matrix. The average of vectors over is defined as . Notations and denote the trace and the transpose of a matrix, respectively. stands for the statistical expectation of a random variable. represents an identity matrix with appropriate size, and indicates a mathematical definition. denotes the th largest eigenvalue of of a matrix.

## Ii System Model

In this paper, we consider a FL problem in a decentralized setting as shown in Fig. 1, in which a set of devices can only communicate with their respective neighbors over a wireless D2D network whose connectivity is characterized by an undirected graph , with denoting the set of nodes and the set of edges. The set of neighbors of node is denoted as . Following the FL framework, each device has available a local data set, and all devices collaboratively train a machine learning model by exchanging model-related information without directly disclosing data samples to one another.

### Ii-a Learning Model

Each device has access to its local data set , which may have non-empty intersection with the data set of any other device , . All devices share a common machine learning model class, which is parametrized by a vector . As a typical example, the model class may consist of a neural network with a given architecture. The goal of the network is to solve the empirical risk minimization problem [32, 15]

 (P0): Minimize\boldmath{θ}   F(\boldmath{θ})≜1K∑i∈Vfi(\boldmath{θ}),

where is the local empirical risk function for device with denoting the loss accruing from parameter on data sample , which may include the effect of regularization.

To enable decentralized learning, we adopt CHOCO-SGD, a communication-efficient variant of DSGD [16]. At the start of each iteration , device has in its memory its current model iterate , the corresponding estimated version and the estimated iterates for all its neighbors . We note that an equivalent version of the algorithm that requires less memory can be found in [16, Algorithm 6], but we do not consider it here since it does not change the communication requirements. Furthermore, at each iteration , device first executes a local update step by SGD based on its data set as

 (1)

where denotes the learning rate, and is an estimate of the exact gradient obtained from a mini-batch of the data set, i.e., .

Then, each device compresses the difference between the locally updated model (1) and the previously estimated iterate . The compressed difference is then exchanged with the neighbors of node . Assuming that communication is reliable — an assumption that we will revisit in the rest of the paper. Each device updates the estimated model parameters for itself and for its neighbors as

 ^\boldmath{θ}(t+1)j=^\boldmath% {θ}(t)j+D(t)(C(t)(% \boldmath{θ}(t+1/2)j−^%\boldmath$θ$(t)j)),j∈{i}∪Ni, (2)

where is a decoding function. Next, device executes a consensus update step by correcting the updated model (1) using the estimated parameters (2) as

 \boldmath{θ}(t+1)i=\boldmath{θ}(t+1/2)i+ζ(t)∑j∈Ni∪{i}wij(^\boldmath{θ}(t+1)j−^\boldmath{θ}(t+1)i), (3)

where is the consensus rate, and the mixing matrix is selected to be doubly stochastic, i.e., , , and . A typical choice is to set for all , , and otherwise, where constant is a design parameter. We postpone discussion regarding the compression operator and the decoding operator to Section II-C. The considered decentralized learning protocol is summarized in Algorithm 1.

Finally, we make the following assumptions that are widely adopted in the literature on decentralized stochastic optimization [16].

###### Assumption II.1

Each local empirical risk function , , is -smooth and -strongly convex, that is, for all and , it satisfies the inequalities

 fi(\boldmath{θ}1)≤fi(\boldmath{% θ}2)+∇fi(\boldmath{θ}2)T(% \boldmath{θ}1−\boldmath{θ}2)+L2∥∥\boldmath{θ}1−\boldmath{θ}2∥∥2, (4)

and

 fi(\boldmath{θ}1)≥fi(\boldmath{% θ}2)+∇fi(\boldmath{θ}2)T(% \boldmath{θ}1−\boldmath{θ}2)+μ2∥∥\boldmath{θ}1−\boldmath{θ}2∥∥2. (5)
###### Assumption II.2

The variance of the mini-batch gradient is bounded as

 ED(t)i[∥^∇fi(% \boldmath{θ}i)−∇fi(\boldmath{θ}i)∥2]≤σ2i, (6)

and its expected Euclidean norm is bounded as

 ED(t)i[∥^∇fi(% \boldmath{θ}i)∥2]≤G2, (7)

where the expectation is taken over the selection of a mini-batch .

### Ii-B Communication Model

As seen in Fig. 2, at the end of every iteration , communication takes place within one communication block of a total number of channel uses spanning over equal-length slots, denoted by . Slow fading remains constant across all iterations, and is binary, determining whether a link is blocked ar not. A link is by definition not blocked, while all the other links are blocked. We assume that the connectivity graph with all the unblocked links as edges satisfies the following assumption.

###### Assumption II.3

Graph is a connected graph.

For all unblocked links , the channel coefficient between device and is modelled as

 h′(t)ij≜√A0(dijd0)−γ2h(t)ij, (8)

where the fast fading coefficient remains unchanged within one communication block and varies independently across blocks, and the path loss gain is constant across all iterations, where is the average channel power gain at reference distance ; is the distance between device and ; and is the path loss exponent factor.

Each device is subject to an energy constraint of per communication block. If a device is active for slots, the energy per symbol is hence given by . The mean-square power of the added white Gaussian noise (AWGN) is denoted as .

### Ii-C Compression

In this subsection, we describe the assumed compression operator and decompression operator that are used in the update (3). We specifically adopt random linear coding (RLC) compression [1]. Let be the linear encoding matrix, where with is a partial Hadamard matrix with mutually orthogonal rows, i.e., ; and is a diagonal matrix with its diagonal entries drawn from uniform distributions such that , for all . The compression operator is given by the linear projection , while decoding takes place as . The concatenation of the compression and decompression operators, namely, , satisfies the compression operator [16, 1]

 E∥∥∥\boldmath{u}−md(% \boldmath{A}(t))T\boldmath{A}(t)\boldmath{u}∥∥∥2=(1−md)∥\boldmath{u}∥2,% for all \boldmath{u}∈Rd×1. (9)

We note that the random matrices need to be shared among devices prior to the start of the communication protocol such that the same random sequence is agreed upon by all devices.

## Iii Digital and Analog Transmission Protocols

In this section, we describe digital and analog wireless implementations of the decentralized learning algorithm reviewed in the previous section. The implementations are meant to serve as prototypical templates for the deployment of decentralized learning. In practice, specific protocols are in need to specify the scheduling strategy used to allocate slots in Fig. 2 to devices. The analysis in this paper, to be detailed in Section IV and Section V, applies to any scheduling algorithm.

### Iii-a Digital Transmission

In digital transmission protocol, devices represent their model updates as digital messages for transmission. The number of bits that device can successfully broadcast to its neighbors during a slot allocated to it by the scheduling algorithm is limited by the neighboring device with the worst channel power gain. Accordingly, we have

 B(t)i=NMlog2(1+P(t)N0Mminj∈Ni|h′(t)ij|2). (10)

We recall that, in (10), the number of time slots per iteration is decided by the scheduling scheme.

To quantize the encoded vector into bits, we employ a simple per-element -bit quantizer with chip-level precision so that or is for double-precision or single-precision floating-point, respectively, according to IEEE standard. Communication constraints thus impose the inequality , which is satisfied by setting . Based on the (received) quantized signal, each device updates the estimated model parameters of its own as well as of its neighbors in as (cf. (2))

 ^\boldmath{θ}(t+1)j=^\boldmath% {θ}(t)j+m(t)jd(\boldmath{A}(t)j)TQb(\boldmath{A}(t)j(\boldmath{θ}(t+1/2)j−^\boldmath{% θ}(t)j)). (11)

In order to implement update (11), each node and its neighbors in set can share a priori a common sequence of (pseudo-)random matrix satisfying the assumption described in Section II-C. If node sends its current value to all neighbors and if , all nodes can thus select the same submatrix from to evaluate (11). The described digital implementation is summarized in Algorithm 2 in Appendices.

### Iii-B Analog Transmission

With analog transmission, devices directly transmit their respective updated parameters by mapping analog signals to channel uses, without the need for digitization. As studied in [38, 18], in addition to broadcast as in digital transmission, it is also useful to schedule all devices that share a common neighbor for simultaneous transmission in order to enable AirComp. The considered class of protocols can accommodate any scheduling scheme that, as in [33], operates over pairs of consecutive time slots in order to leverage AirComp. In the first slot of each pair, one or more center nodes receive a superposition of the signals simultaneously transmitted by all their respective neighbors. In the second slot, the center nodes serve as broadcast transmitters communicating to all their neighbors. The total number of time slots is thus given by twice as the number of pairs of time slots, which is specified by the scheduling policy in use.

To elaborate on the operation of analog transmission protocol, we will use the following notation. For each device , we define a set of transmission slots, with partitioned into disjoint subsets and (). Subset denotes the set of transmission slots in which device broadcasts to its neighbors, and denotes the set of slots in which device transmits to enable AirComp. Similarly, we define the set of receiving slots for device as with , where and denote the sets of receiving slots in which device receives from a transmitter in broadcast and AirComp modes, respectively. Fig. 3 serves as an example illustrating the above definitions.

We now describe the transmitted and the received signals in each pair of slots of the communication protocol.

Odd slots: All devices operating in AirComp mode for a center node in an odd slot concurrently transmit the signals

 \boldmath{x}(t,s)ij=√γ(t,s)jh′(t)ijwji\boldmath{A}(t)(\boldmath{θ}(t+1/2)i−^\boldmath{% θ}(t)i), (12)

where is a power scaling factor for channel alignment at device . The receiving center node, device obtains

 \boldmath{y}(t,s)j=√γ(t,s)j∑i∈N(s)jwji\boldmath{A}(t)(% \boldmath{θ}(t+1/2)i−^%\boldmath$θ$(t)i)+\boldmath{n}(t,s)j, (13)

where is the neighboring set of device operating in AirComp at slot , and is the received AWGN at slot of iteration . Device estimates the combined model parameters via the linear estimator

 ^\boldmath{y}(t,s)j=md(% \boldmath{A}(t))TR{\boldmath{y}(t,s)j/√γ(t,s)j}. (14)

Even slots: Any device operating in broadcast mode in an even slot transmits a signal

 \boldmath{x}(t,s)i=√α(t,s)i% \boldmath{A}(t)(\boldmath{θ}(t+1/2)i−^\boldmath{θ}(t)i), (15)

where is device ’s transmitting power scaling factor in slot of iteration . Each neighboring device , with , receives from device the signal

 \boldmath{y}(t,s)ij=√α(t,s)ih′(t)ij\boldmath{A}(t)(\boldmath{θ}(t+1/2)i−^\boldmath{θ}(t)i)+\boldmath{n}(t,s)j, (16)

where is the received AWGN. Device estimates the signal via the linear estimator

 ^\boldmath{y}(t,s)ij=wjimd(% \boldmath{A}(t))TR⎧⎪ ⎪⎨⎪ ⎪⎩\boldmath{y}(t,s)ij√α(t,s)ih′(t)ij⎫⎪ ⎪⎬⎪ ⎪⎭, (17)

where denotes the real part of its argument.

Next, device updates its estimate of the combined model parameters from all neighboring devices in by aggregating the estimates obtained at all receiving slots in set as

 ^\boldmath{y}(t+1)j=^\boldmath{y}% (t)j+∑s∈SARj^\boldmath{y}(t,s)j+∑s∈SBRj^% \boldmath{y}(t,s)isj, (18)

where node is the node that transmits in broadcast mode in slot . The initial estimate of the combined model parameters is given by , .

The power scaling parameters (cf. (12)) and (cf. (15)) for need to be properly chosen such that the power consumed by device per communication block satisfies

 ∑s∈SATiE[∥% \boldmath{x}(t,s)ijs∥2]+∑s∈SBTiE[∥\boldmath{x}(t,s)i∥2]≤NP(t),∀i∈V, (19)

where node is the center node connected to node in slot . Applying a simple equal power policy across different transmission slots of a device for all communication blocks, we have (cf. (19))

 E[∥\boldmath{x}(t,s)ij∥2] ≤NP(t)/|STxi|,∀s∈SATi, (20) E[∥\boldmath{x}(t,s)i∥2] ≤NP(t)/|STxi|,∀s∈SBTi. (21)

In addition, device needs to update the estimate of its own model parameter as

 ^\boldmath{θ}(t+1)j=^\boldmath% {θ}(t)j+md(\boldmath{A}(t))T% \boldmath{A}(t)(\boldmath{θ}(t+1/2)j−^\boldmath{θ}(t)j). (22)

Finally, device approximates update (3) as

 \boldmath{θ}(t+1)j=\boldmath{θ}(t+1/2)j+ζ(t)(wjj^\boldmath{θ}(t+1)j+^\boldmath{y}(t+1)j−^\boldmath{θ}(t+1)j). (23)

To sum up, the proposed analog implementation is presented in Algorithm 3 in Appendices.

## Iv Convergence Analysis For Digital Transmission

In this section, we derive convergence properties of the general class of digital transmission protocols presented in Section III-A. The analysis holds for any fixed transmission schedule, which determines the number of slots. We start by recalling that, at each iteration , update (11) is carried out by device for all nodes . In (11), the concatenation of compression, quantization, and decompression yields an output vector for the input vector . The number of rows of matrix at iteration depends on the current rate (10) supported by the fast fading channels between device and its neighbors. Taking the randomness of the fading realizations into account, the counterpart of the compression operator (9) under digital transmission is given by the following lemma.

###### Lemma IV.1

On average over RLC and fading channels, the mean-square estimation error for the concatenation of compression, quantization, and decompression under digital transmission satisfies

 E∥∥ ∥∥\boldmath{u}−m(t)id(\boldmath{A}(t)i)TQb(\boldmath{A}(t)i\boldmath{u})∥∥ ∥∥2≤(1−ω(t))∥% \boldmath{u}∥2, (24)

for all and for all , with , where we have with the function , , defined as

 G(t)i(n)=exp⎛⎝−N0P(t)1MA0(2nbMN−1)∑j∈Ni(dijd0)r⎞⎠. (25)
###### Proof:

Please refer to Appendix C. \qed

By (24), the parameter is a measure of the quality of the reconstruction of the model difference used in update (11). Supposing static channel conditions in which the transmission rate (10) remains constant over iterations, we have , . In these conditions, the right-hand side (RHS) of (24) can reduce to (see Appendix C), which is exactly the RHS of (9) given and . Note that is increasing with . In particular, when the transmission power , it is seen in (25) that , thus leading to zero mean-square estimation error ().

With Lemma IV.1, the convergence properties of the digital protocol can be quantified in a manner similar to [16, Theorem 4]. To this end, we define the following topology-related parameters dependent on the mixing matrix : the spectral gap ; the parameter ; and the function that depends on the spectral gap and on the model-difference estimation quality . Then, the convergence of the digital implementation is provided by the following theorem.

###### Theorem IV.1 (Optimality Gap for Digital Transmission [16, Theorem 19])

For learning rate with , and consensus step size , on average over RLC and fading channels , Algorithm 2 yields an optimality gap satisfying

 E[F(~\boldmath{θ}T)]−F∗≤E[1STT−1∑t=0w(t)F(¯% \boldmath{θ}(t))]−F∗≤μ3.25a3−3.25a2STv(0)e+1.625(2a+T)TμST¯σ2Kcentralized error+158.45×24LTμ2(p(δ,ω))2STG2