I Introduction

Abstract

We study federated machine learning at the wireless network edge, where limited power wireless devices, each with its own dataset, build a joint model with the help of a remote parameter server (PS). We consider a bandwidth-limited fading multiple access channel (MAC) from the wireless devices to the PS, and propose various techniques to implement distributed stochastic gradient descent (DSGD) over this shared noisy wireless channel. We first propose a digital DSGD (D-DSGD) scheme, in which one device is selected opportunistically for transmission at each iteration based on the channel conditions; the scheduled device quantizes its gradient estimate to a finite number of bits imposed by the channel condition, and transmits these bits to the PS in a reliable manner. Next, motivated by the additive nature of the wireless MAC, we propose a novel analog communication scheme, referred to as the compressed analog DSGD (CA-DSGD), where the devices first sparsify their gradient estimates while accumulating error from previous iterations, and project the resultant sparse vector into a low-dimensional vector for bandwidth reduction. We also design a power allocation scheme to align the received gradient vectors at the PS in an efficient manner. Numerical results show that D-DSGD outperforms other digital approaches in the literature; however, in general the proposed CA-DSGD algorithm converges faster than the D-DSGD scheme and other schemes in the literature, and reaches a higher level of accuracy. We have observed that the gap between the analog and digital schemes increases when the datasets of devices are not independent and identically distributed (i.i.d.). Furthermore, the performance of the CA-DSGD scheme is shown to be robust against imperfect channel state information (CSI) at the devices. Overall these results show clear advantages for the proposed analog over-the-air DSGD scheme, which suggests that learning and communication algorithms should be designed jointly to achieve the best end-to-end performance in machine learning applications at the wireless edge.1

\patchcmd

Federated Learning over Wireless
Fading Channels Mohammad Mohammadi Amiri andM. Mohammadi Amiri is with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544, USA (e-mail: mamiri@princeton.edu).D. Gündüz is with the Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, U.K. (e-mail: d.gunduz@imperial.ac.uk). Deniz Gündüz

I Introduction

As the dataset sizes and model complexities grow, distributed machine learning (ML) is becoming the only viable alternative to centralized ML. In particular, with the increasing amount of information collected through wireless edge devices, such centralized solutions are becoming increasingly costly, due to the limited power and bandwidth available, and less desirable due to privacy concerns. Federated learning (FL) has been proposed as an alternative privacy-preserving distributed ML scheme, where each device participates in training using only locally available data with the help of a parameter server (PS) [11]. In FL devices exchange model parameters and their local updates with the PS, but the data never leaves the devices. As mentioned, in addition to privacy benefits, this is an attractive approach for wireless edge devices when dataset sizes are very large.

ML problems often involve the minimization of the empirical loss function

(1)

where denotes the model parameters to be optimized, is the training dataset of size consisting of data samples and their labels, and is the loss function defined by the learning task. The minimization of is typically carried out through iterative stochastic gradient descent (SGD) algorithm, in which the model parameter vector at iteration , , is updated with a stochastic gradient

(2)

which satisfies , where is the learning rate. SGD can easily be implemented across multiple devices, each of which has access to only a small fraction of the dataset. In distributed SGD (DSGD), at each iteration, device computes a gradient vector based on the global parameter vector with respect to its local dataset, denoted by , and sends the result to the PS, which updates the global parameter vector according to

(3)

where denotes the number of wireless devices, and , . In FL, each device participating in the training can also carry out multiple model updates as in (3) locally, and share the overall difference with respect to the previous global model parameters with the PS [11].

What distinguishes FL from conventional ML is the large number of devices that participate in the training, and the low-capacity and unreliable links that connect these devices to the PS. Therefore, there have been significant research efforts to reduce the communication requirements in FL [11, 18, 12, 19, 23, 8, 22, 3, 32, 35, 31, 5, 14, 25, 1, 16, 26, 7, 21, 2, 30, 24, 15, 28]. However, these and follow-up studies ignore the physical layer aspects of wireless connections and consider interference-and-error-free links from the participating devices to the PS, even though FL has been mainly motivated for mobile devices.

In this paper, we consider DSGD over-the-air; that is, we consider a wireless shared medium from the devices to the PS over which they send their gradient estimates. To emphasize the limitations of the wireless medium, we note that the dimension of some of the recent ML models, which also determines the size of the gradient estimates or model updates that must be transmitted to the PS at each iteration, can be extremely large, e.g., the 50-layer ResNet network has million weight parameters, while the VGGNet architecture has approximately million parameters. On the other hand, available channel bandwidth is typically small due to the bandwidth and latency limitations; for example 1 LTE frame of 5MHz bandwidth and duration 10ms can carry only 6000 complex symbols. In principle, we can treat each iteration of the DSGD algorithm as a distributed over-the-air lossy computation problem. FL over a static Gaussian MAC is studied in [17], where both a digital scheme, which separates computation and communication, and an analog over-the-air computation scheme are introduced. While the digital scheme exploits gradient quantization followed by independent channel coding at the participating wireless devices, the analog scheme exploits the additive nature of the wireless channel and gradient sparsification, and employs random linear projection for dimensionality reduction. In [36] the authors consider a fading MAC, and also apply analog transmission, where each entry of a gradient vector at each of the devices is scheduled for transmission depending on the corresponding channel condition. A multi-antenna PS is considered in [34], where receive beamforming is used to maximize the number of devices scheduled for transmission at each iteration.

Here, we extend our previous works [17, 4], and study DSGD over a wireless fading MAC. While we consider gradient descent, where each device sends its local gradient estimate at each iteration, the results can easily be extended by letting the devices send their model updates after several local SGD iterations. We first consider the separate computation and communication approach, and propose a digital DSGD (D-DSGD) scheme, in which only a single device is opportunistically scheduled for transmission at each iteration of DSGD based on the channel conditions from the devices to the PS. The scheduled device quantizes its gradient estimate to a finite number of bits using the gradient compression scheme in [7] while accumulating the error from previous iterations (this will be clarified later), and employs a channel code to transmit the bits over the available bandwidth-limited channel to the PS. For the MNIST classification task, it is shown that the proposed digital approach D-DSGD outperforms digital schemes that employ QSGD [3] or SignSGD [5] for gradient compression. We also observe that the proposed opportunistic scheduling scheme outperforms the scheme when all the devices participate in the transmission, with each device allocated orthogonal channel resources to communicate with the PS.

We then study analog transmission from the devices to the PS motivated by the signal-superposition property of the wireless MAC. At first, we extend the scheme in [36] by introducing error accumulation, which is shown to improve the performance. We then propose a novel scheme, inspired by the random projection used in [17] for dimensionality reduction, which we will refer to as the compressed analog DSGD (CA-DSGD). With CA-DSGD, we exploit the similarity in the sparsity patterns of the gradient estimates at different devices to speed up the computations, where each device projects its gradient estimate to a low-dimensional vector and transmits only the important gradient entries while accumulating the error. CA-DSGD scheme provides the flexibility of adjusting the dimension of the gradient estimate sent by each device, which is particularly important for bandwidth-limited wireless channels, where the bandwidth available for transmission may not be sufficient to send the entire gradient vector at a single time slot. A power allocation scheme is also designed, which aligns the vectors sent by different devices at the PS while satisfying the average power constraint. Numerical results for the MNIST classification task show that the proposed CA-DSGD scheme improves upon the other analog and digital schemes under consideration with the same average power constraint and bandwidth resources, with the improvement more significant when the datasets across devices are non-independent and identically distributed (i.i.d.). Its performance is also shown to be robust against imperfect channel state information (CSI) at the devices, whereas digital schemes are sensitive to accurate CSI at the devices, particularly if close to capacity operation is desired. In addition to these benefits of the proposed CA-DSGD scheme we make the following observations:

  1. The improvement of analog over-the-air computation compared to the D-DSGD scheme is particularly striking in the low power regime. This is mainly due to the “beamforming” effect of simultaneously transmitting highly correlated gradient estimates.

  2. While both the convergence speed and the accuracy of the D-DSGD scheme increase significantly with the available average power, the performance of the analog schemes improve marginally. This highlights the energy efficiency of over-the-air computation, and makes it particularly attractive for FL across low-power IoT sensors.

  3. Increasing the number of devices improves the accuracy for all the schemes even if the total dataset size and total power consumption remain the same. This “diversity gain” is much more limited for the analog scheme, and diminishes further as the training duration increases.

  4. We observe that the performance of the CA-DSGD scheme improves if we reduce the bandwidth used at each iteration, and increase the number of DSGD iterations instead.

Notations: and represent the sets of real and complex values, respectively. For vectors and with the same dimension, returns their Hadamard/entry-wise product. For a vector , and return the entry-wise real and imaginary components of , respectively. Also, represents the concatenation of two row vectors and . We denote a zero-mean normal distribution with variance by , and represents a complex normal distribution with independent real and imaginary terms each distributed according to . For positive integer , we let . We denote the cardinality of set by , and norm of vector by . The imaginary unit is represented by .

Fig. 1: Illustration of wireless FL architecture. The PS sends the updated parameter vector to all the wireless devices over an error-free ideal multicast channel, while the gradient estimates, computed by each device using only the available local dataset, are transmitted to the PS over the fading uplink channel.

Ii System Model

We consider FL across wireless devices, each with its own local dataset, which employ DSGD with the help of a remote PS. We model the channel from the devices to the PS as a wireless fading MAC, and OFDM is employed for transmission. The system model is illustrated in Fig. 1. The parameter vector at iteration is denoted by , and we assume that it is delivered from the PS to the devices over an error-free shared link. We denote the set of data samples available at device by , with , , and the stochastic gradient computed by device with respect to local data samples by , . At the -th iteration of the DSGD algorithm in (3), the local gradient estimates of the devices are sent to the PS over a wireless fading MAC using subchannels for a total of time slots, where (in practice, we typically have ). We denote the length- channel input vector transmitted by device at the -th time slot of the -th iteration of the DSGD by . The channel output received by the PS at the -th time slot of the -th iteration, , is given by

(4)

where is the entry-wise scheduling vector with the -th entry , if , and , otherwise2, is the channel gains vector from device to the PS with the -th entry i.i.d. according to , e.g., Rayleigh fading, and is complex Gaussian noise vector with the -th entry i.i.d. according to . The channel input vector of device at the -th time slot of iteration , , is a function of the channel gain vector , current parameter vector , the local dataset , and the current gradient estimate at device , , . We assume that, at each time slot, the CSI is known by the devices and the PS. For a total of iterations of the DSGD algorithm, the following total average transmit power constraint is imposed at device :

(5)

where the expectation is taken over the randomness of the channel gains.

The goal is to recover at the PS, which then updates the model parameter as in (3) after time slots. However, due to the pre-processing performed at each device and the distortion caused by the wireless channel, the PS uses a noisy estimate to update the model parameter. Having defined , we have for some update function . The updated model parameter is then multicast to the devices by the PS through an error-free shared link, so the devices receive a consistent parameter vector for their computations in the next iteration.

We remark that the goal is to recover the average of the local gradient estimates of the devices at the PS, which is a distributed lossy computation problem over a noisy MAC. We will consider both a digital approach based on separating computations and communication, and an analog transmission approach, where the gradients are transmitted simultaneously over the wireless MAC in an analog fashion, without being converted into bits first. Analog transmission has been well studied for image/ video multicasting over wireless channels in recent years [9, 33, 27], and here we employ the random projection technique proposed in [27] for image transmission over a bandwidth limited wireless channel.

Iii Digital DSGD

We first consider DSGD with digital transmission of the gradient estimates by the devices over the wireless fading MAC, referred to as the digital DSGD (D-DSGS) scheme. For D-DSGD, we consider , i.e., the parameter vector is updated after each time slot, and drop the dependency on time slot parameter .

The goal here is to schedule devices and employ power allocation across time slots such that devices can transmit to the PS their local gradient estimates as accurately as possible. A possible approach is to schedule all the devices at all the iterations; however, due to the interference among the devices this will result in each device sending a very coarse description of its local gradient estimate. Instead here we will schedule the devices opportunistically according to their channel states.

In particular, with the knowledge of channel state information (CSI), at each iteration , we select the device with the largest value of , . Accordingly, the index of the transmitting device at iteration is given by:

(6)

We note that, due to the symmetry in the model, the probability of selecting a device at any time is the same, . The power allocated to device at the -th iteration is given by , where , if , and it should satisfy

(7)

For the rate of transmission, we will use a capacity upper bound. The -th entry of the channel output at the -th iteration, which is the result of transmission from device , is given by

(8)

which is equivalent to a wireless fast fading channel with a limited number of channel uses, with CSI known at both the transmitter and the receiver. In the following, we provide an upper bound on the capacity of this channel by treating it as parallel Gaussian channels. This is equivalent to coding across infinitely many realizations of this -dimensional channel. For a transmit power , the capacity of this parallel Gaussian channel is the result of the following optimization problem [29, Section 5.4.6]:

(9)

The optimization problem in (III) is solved through waterfilling, and the optimal power allocation is given by

(10)

where is determined such that . Having calculated , the capacity of the wireless channel in (8) is given by

(11)

which provides an upper bound on the capacity of the communication channel between device and the PS. We would like to emphasize that this capacity upper bound can be quite loose especially for small values.

We adopt the D-DSGD scheme proposed in [17, Section III], in which the gradient estimate , computed at device , is added to the error accumulated from previous iterations, denoted by , where we set , . For the compression of the error compensated gradient vector , we employ the scheme in [7], where it is first sparsified by setting all but the highest positive and the smallest negative entries to zero, where (in practice, the goal is to have , ). Then, device computes the mean value of the positive and negative entries of the resultant sparse vector, denoted by and , respectively, . If , device sets all the negative entries of the sparse vector to zero and all the positive entries to , and vice versa, if , . Let denote the resultant sparse vector at device , . After computing at device , , the error accumulation vector, which maintains those entries of vector that are not transmitted, is updated as follows:

(12)

We note that, if user is scheduled, the accumulated error at device is the difference between and its sparsified version , ; on the other hand, if device is not scheduled, we maintain vector entirely as the accumulated error. For a sparsity level , the D-DSGD scheme requires transmission of a total of [17, Equation (10)]

(13)

We assume that device employs a capacity achieving channel code using the optimal value of the capacity upper bound in (11), and we set the sparsity level as the highest integer satisfying .

We highlight here that with the proposed D-DSGD algorithm, all the devices compute the gradient estimates based on the parameter vector received from the PS and their local datasets; however, only a single device is scheduled for transmission over the MAC with the scheduling policy given in (6). The PS updates the parameter vector after receiving the gradient estimate from the scheduled device and shares it with all the devices to continue their computations.

Remark 1.

An alternative device selection criteria, rather than the one in (6), is to select the device with the highest capacity upperbound. Here we do not employ this selection criteria due to the overhead introduced by solving the waterfilling power allocation for all devices. This will be prohibitive for large and avlues.

Remark 2.

Instead of scheduling a single device at each iteration, we can schedule all or a subset of the devices at each iteration, and allocate distinct subchannels to different devices. In Section V we consider the so-called orthogonal digital DSGD (OD-DSGD) scheme, which schedules all the devices at each iteration, where each device is allocated distinct subchannels. We have observed that OD-DSGD performs much worse than D-DSGD. It is worth noting that scheduling multiple devices reduces the number of subchannels allocated to each device for orthogonal transmission, and forces the devices to transmit their information at shorter blocklengths. In practice, this would result in a higher error probability or reduced transmission rate [20]. An alternative approach is to code across time slots by allocating multiple time slots to a scheduled user. This requires the information about the future channel gains, which is not possible in our model since the channel gains are assumed to be i.i.d. across time slots and users.

We will evaluate the performance of the D-DSGD scheme in Section V, and study in detail the impact of various system parameters, such as the average power constraint and the number of devices on the performance. We will also compare the D-DSGD scheme with other compression schemes in the literature, as well as the analog transmission of local gradients, which we present next.

Iv Analog DSGD

Analog DSGD is motivated by the fact that the PS is only interested in the average of the gradient vectors, and the underlying wireless MAC can provide the sum of the gradients if they are sent in an uncoded fashion. We first present a generalization of the over-the-air computation approach introduced in [36], referred to as entry-wise scheduled analog DSGD (ESA-DSGD), and then extend it by introducing error accumulation, referred to as error compensated ESA-DSGD (ECESA-DSGD). Finally, we propose a novel analog scheme, built upon our previous work [17], referred to as compressed analog DSGD (CA-DSGD).

Iv-a Esa-Dsgd

With the ESA-DSGD scheme studied in [36], each device sends its gradient estimate entirely after applying power allocation, which is to satisfy the average power constraint. At the -th iteration of the DSGD, device , , transmits its local gradient estimate over time slots by utilizing both the real and imaginary components of the available subchannels. We define, for , ,

(14a)
(14b)
(14c)

where is the -th entry of , and we zero-pad to have dimension . We note that, according to (14),

(15)

where . At the -th time slot of the -th iteration, device , , sends , where is the power allocation vector, which is set to satisfy the average transmit power constraint. Thus, after time slots, each device sends its gradient estimate of dimension entirely. The -th entry of the power allocation vector is set as follows:

(16)

for some , set to satisfy the average transmit power constraint. According to (16), each entry of a gradient vector is transmitted if its corresponding channel gain is over a threshold. The set of devices selected to transmit the -th entry of the channel input vector at the -th time slot is given by, , ,

(17)

In the following, we analyze the average transmit power of the ESA-DSGD scheme based on the power allocation design given in (16). We set the parameters and to obtain the same average transmit power at device , , in time slot , , of iteration , which satisfies

(18)

According to (16), , we have

(19)

We highlight that the entries of the gradient vector are independent of the channel gains , . Since the power allocation vector is a function of , it follows that, for , ,

(20)

Note that follows an exponential distribution with mean , . Thus, we have

(21)

where . It follows that, ,

(22)

where we define . Given the threshold value , we set

(23)

which we note that it does not differ significantly across devices, since values of , , are not too different. We assume that, before transmitting , device , , sends to the PS in an error-free fashion using an error correcting code, and the PS computes

(24)

This factor will be used at the PS to scale down the received signal.

Here we analyze the received signal at the PS. By substituting and into (4), it follows that, for , ,

(25)

The PS has perfect CSI, and hence, knows set . Its goal is to recover and , which provide estimates for and , respectively. The PS estimates , for , , using its noisy observation , given in (25), as

(26)

and estimates through

(27)

Estimated vector is then used to update the parameter vector as .

Remark 3.

We remark here that the scheme in [36] imposes a stricter average power constraint per iteration of the DSGD, i.e., at device we should have

(28)

For fairness in our comparisons we relax this power constraint, and impose the one in (5), which constrains the average power over all the iterations.

Iv-B Ecesa-Dsgd

With the ESA-DSGD scheme, entries of the gradient vectors that are not sent due to poor channel conditions are completely forgotten. The proposed ECESA-DSGD scheme modifies ESA-DSGD by incorporating error accumulation technique to retain the accuracy of local gradients.

We denote the error accumulation vector calculated by device at the -th time slot of the -iteration by , and set , . Similarly to the ESA-DSGD scheme, with ECESA-DSGD, each device sends its entire gradient estimate of dimension through time slots, where the gradient estimates at the devices are zero-padded to dimension . After computing and obtaining according to (14), device , , updates its gradient estimate with the accumulated error as , for , and transmits vector , where is the power allocation vector, whose -th entry is determined as follows:

(29)

for some . Device , , then updates the -th entry of vector as follows:

(30)

where is the indicator function, and denotes the -th entry of , for , . Thus, the -th entry of vector is given by, for , ,

(31)

According to (IV-B), each entry of the gradient vector that is not transmitted due to the power allocation given in (29), is retained in the error accumulation vector for possible transmission in the next iteration.

Here we provide the power analysis of the ECESA-DSGD scheme. For fairness, we set the parameters and yielding an average transmit power at device , , in time slot , , of iteration , satisfying the constraint in (18). Since the power allocation vector of ECESA-DSGD, given in (29), is similar to that of the ESA-DSGD, by following a similar procedure we obtain the following average power at device for ECESA-DSGD:

(32)

where we define . For a fixed , we set

(33)

shared with the PS in an error-free manner, through which the PS computes

(34)

From the power allocation in (29), it follows that, for , ,

(35)

where we have

(36)

Having perfect CSI, the PS’s goal is to recover , the real and imaginary terms of which provide estimates for and , respectively, for , . The PS estimates as

(37)

and estimates through

(38)

for , . After time slots, estimated vector is then used to update the parameter vector as .

Iv-C Ca-Dsgd

As opposed to ESA-DSGD and ECESA-DSGD, which aim to transmit all the gradient entries to the PS at each DSGD iteration, i.e., , the CA-DSGD scheme proposed here reduces the transmission bandwidth by reducing the dimension of the gradient vector by a linear projection. Each device projects its gradient estimate to dimension , which can then be transmitted through time slots, for some . The details of CA-DSGD are given in Algorithm 1.

We describe the CA-DSGD scheme for an arbitrary number of time slots per iteration of DSGD, which is determined later. At each iteration the devices sparsify their gradient estimates as described below. They employ error accumulation [22], where the accumulated error vector at device until iteration is denoted by , where we set , . After computing , device updates its estimate with the accumulated error as , . Next, the devices apply gradient sparsification, where device sets all but elements with the highest magnitudes of vector to zero, where is a design parameter, and obtains a sparse vector , . This -level sparsification is represented by function in Algorithm 1, i.e., . device , , then updates as . To transmit the sparse vectors over the limited-bandwidth channel, devices employ a random projection matrix, similarly to compressive sensing.

1:
2:Initialize and
3:for  do
4:
  • devices do:

5:     for  in parallel do
6:         Compute with respect to local dataset
7:         
8:         
9:         
10:         
11:         for  do
12:              
13:         end for
14:     end for
15:
  • PS does:

16:     if  then
17:         
18:         
19:     else
20:         
21:     end if
22:end for
Algorithm 1 CA-DSGD

A pseudo-random matrix , with each entry i.i.d. according to , is generated and shared between the PS and the devices, where , for an arbitrary . At each iteration , device computes , and aims to transmit it to the PS over time slots. We define, for , ,

(39a)