Rate Region for Indirect Multiterminal Source Coding in Federated Learning

Rate Region for Indirect Multiterminal Source Coding in Federated Learning

Abstract

One of the main focus in federated learning (FL) is the communication efficiency since a large number of participating edge devices send their updates to the edge server at each round of the model training. Existing works reconstruct each model update from edge devices and implicitly assume that the local model updates are independent over edge device. In FL, however, the model update is an indirect multi-terminal source coding problem where each edge device cannot observe directly the source that is to be reconstructed at the decoder, but is rather provided only with a noisy version. The existing works do not leverage the redundancy in the information transmitted by different edges. This paper studies the rate region for the indirect multiterminal source coding problem in FL. The goal is to obtain the minimum achievable rate at a particular upper bound of gradient variance. We obtain the rate region for multiple edge devices in general case and derive an explicit formula of the sum-rate distortion function in the special case where gradient are identical over edge device and dimension. Finally, we analysis communication efficiency of convex Mini-batched SGD and non-convex Minibatched SGD based on the sum-rate distortion function, respectively.

Federated learning, multiterminal rate-distortion theory, quadratic vector Gaussian CEO problem, communication efficiency.
\UseRawInputEncoding

I Introduction

Federated learning (FL) [22, 16, 7, 37] is a new edge learning framework that enables many edge devices to collaboratively train a machine learning model without exchanging datasets under the coordination of an edge server. In FL, each edge device downloads a shared model from the edge server, computes an update to the current model by learning from its local dataset, then sends this update to the edge server. Therein, the updates are averaged to improve the shared model. Compared with traditional learning at a centralized data center, FL offers several distinct advantages, such as preserving privacy, reducing network congestion, and leveraging distributed on-device computation. FL has recently attracted significant attention from both academia and industry, such as [4, 21, 18, 31].

The main focus in the research area is communication-efficient FL. Specifically, the communication efficient FL is to achieve better convergence rate (high model accuracy) with lower communication costs. The works of SGD convergence analysis [8, 13] state that the convergence rate mainly depends on the variance bound of gradient and the number of updates. The communication cost depends on the communication cost per round and the communication rounds. Fig. 1 shows three types of technologies that improve communication efficiency in FL, including aggregation frequency control, compression schemes and user scheduling.

Fig. 1: Illustration of technologies that improve communication efficiency in FL.

1) Aggregation Frequency Control: Several recent methods have been proposed to improve communication-efficiency in federated settings by allowing for a variable aggregation frequency to be applied on each edge device in parallel at each communication round. Reducing the aggregate frequency will reduce the communication cost by decreasing the number of communication rounds, while increase the convergence rate by increasing the gradient variance. Hence, the control of aggregation frequency significantly influences the communication-efficiency. The work [22] proposed a framework named ”FedAvg” by updating local model at each device with a given number of SGD iterations and model averaging. The work [14] studied the distributed machine learning across multiple datacenters in different geographical locations, where a threshold-based approach toreduce the communication among different datacenters is proposed. The work [34] propose a control algorithm that determines the best tradeoff between the local update and global parameter aggregation under a given resource budget.

2) Model Compression: While aggregation frequency control can reduce the number of communication rounds, model compression schemes such as sparsification and quantization can significantly reduce the communication bits per round with increasing gradient variance. Partially initiated by a 1-bit implementation of SGD by Microsoft in [29], a large number of recent studies revisited the idea of low-precision training as a means to reduce communication [2, 38, 35, 27, 6]. Other approaches for low-precision training focus on sparsification of gradients, either by thresholding small entries or by random sampling [17, 1, 19, 32]. Several approaches, including QSGD [2] and TernGrad [35], implicitly combine quantization and sparsification to maximize communication efficiency.

3) User Scheduling: In federated learning, edge devices may differ in terms of computation capacity, communication quality and data distribution. The goal of user scheduling is to minimize the communication cost per round and transmit the bits required by quantization, by selecting a subset of devices to upload model update in each round. A host of studies propose various scheduling policy for FL, ranging from minimizing the transmission latency [24], local model importance aware user scheduling [20, 3], to accounting for both the staleness and the communication quality to improve the communication efficiency [36].

All the existing researches aim to reconstruct each model update from edge devices and implicit assume that the local model updates are independent over edge device. For example, the model compression schemes quantize and/or compress the models or stochastic gradients at each individual edge device, and the edge server recovers the exact value of the stochastic gradients of each individual edge device. The user scheduling works and aggregation frequency control works also assume that the communication cost increases linearly with the number of edge devices. However, we observe that the main objective in FL is a good estimate of a model update at the edge server by using the information received from edge devices, rather than the exact recovery of each model update from each devices. It is an indirect multi-terminal source coding problem, where each edge device cannot observe directly the source that is to be reconstructed at the decoder, but is rather provided only with a noisy version. Specifically, in FL, the objective is estimating the global model update computed by gradient decent (GD) on global dataset, while the local model update computed by each edge device is a noisy version of the global model update. Furthermore, the local gradients are highly correlated among different workers, providing opportunity for correlated source coding. Hence, the existing works do not leverage the redundancy in the information transmitted by different edges.

In addition, although the model compression schemes have provided the tradeoff between the variance of received gradients at edge server and the achievable communication bits, their proposed achievable communication bits are inconsistent. It is hard to combine the existing model compression schemes with the communication efficient FL works such as the user aggregation frequency control and user scheduling. Therefore, the transmission model of communication efficient FL urgently requires a theoretical limits of communication bits at a particular gradient variance instead of simply assuming that the gradient parameters are transmitted perfectly with a constant communication bits.

Motivated by the above issue, in this paper, we derive the rate region for the indirect multiterminal source coding problem in FL referred to as the quadratic vector Gaussian CEO problem. Our goal is to obtain the minimum achievable rate at a particular upper bound of gradient variance. We formulate the indirect multiterminal source coding problem based on the observation of the gradient distribution and solve it from the standpoint of multiterminal rate-distortion theory. Our result can be regarded as a tool to analysis the communication efficiency in a certain FL system. The main contributions of this work are outlined below:

  • Gradient distribution model: We have made assumptions of the distribution of global gradient and local gradient in federated learning, respectively and they are justified by the experiment results on MNIST dataset. We reveal that the multiterminal source coding problem in FL is the quadratic vectored Gaussian CEO problem base on a thorough understanding of gradient distributions.

  • Rate region results: We derive the rate region for the edge devices quadratic Gaussian vectored CEO problem. We show that the Berger-Tung achievable region is indeed tight. Our converse proof is inspired by the converse in work [25]. But unlike there, the distortion of our problem is the variance of estimator instead of MSE.

  • Explicit rate-distortion function We derive an explicit formula of the rate-distortion function in the special case where gradient are identical over edge device and dimension. The derived function has the form of a sum of two nonnegative functions. One is a classical rate-distortion function for single Gaussian source and the other is a new rate-distortion function which dominates the performance of the system for a relatively small distortion.

  • Communication efficiency analysis: we analysis communication efficiency of convex Minibatched SGD and non-convex Minibatched SGD based on the sum-rate distortion function, respectively. We provide an inherent trade-off between communication cost and convergence guarantees.

The rest of this paper is organized as follows. The FL system is modeled in Section II. Section III formulates the multi-terminal source coding problem in FL. In Section IV, we give the rate region results. Section V we prove the direct and converse the rate region results. Section VI derives the explicit rate region. Section VII provides the communication efficiency in FL. Finally we conclude the paper in Section VIII.

Ii Federated Learning

There are many variants in FL, which have different preconditions and convergence guarantees. Our proposed rate region results are very adaptable. They can be applied to any FL whose local gradient distribution meets our assumed Gaussian distribution. To facilitate the presentation, we only focus on a basic FL setup. However, the results can be extended to cases with gradient distribution under our assumptions. In this section, we introduce the system model and the convergence rate of federated learning in error-free communication.

Ii-a System Model

Fig. 2: Illustration of federated learning system in error-free communication.

We consider a FL framework as illustrated in Fig. 2, where a shared AI model (e.g., a classifier) is trained collaboratively across edge devices via the coordination of an edge server. Let denote the set of edge devices. Each device collects a fraction of labelled training data via interaction with its own users, constituting a local dataset, denoted as . Let denote the -dimensional model parameter to be learned. The loss function measuring the model error is defined as

(1)

where is the loss function of device quantifying the prediction error of the model on the local dataset collected at the -th device, with being the sample-wise loss function, and is the union of all datasets. Let denote the gradient vector calculated through the gradient descent (GD) algorithm at iteration . The minimization of is typically carried out through the Minibatched stochastic gradient descent (Minibatched SGD) algorithm, where device ’s local dataset is split into mini-batches of size and at each iteration , we draw one mini-batch randomly and calculate the local gradient vector as

(2)

When the mini-batch size , the Minibatched SGD algorithm reduces to SGD algorithm. In this case, we say the local gradient has variance at iteration , i.e., . In general case, local gradient has variance at iteration . If all the local gradients are available at edge server through error-free transmission, the optimal estimator of is the Sample Mean Estimator, i.e., , where is global batch size. It is not hard to see that the variance of the optimal estimator is . Then the edge server update the model parameter as

(3)

with being the learning rate at iteration .

Ii-B Convergence rate

Given access to local gradients, and a starting point , Minibatched SGD builds iterates given by Equation (3), projected onto , where is a sequence of learning rate. In this setting, one can show:

Theorem 1 ([8], Theorem 6.3)

Let be unknown, convex and -smooth. Let be given, and let . Let be fixed. Given repeated, independent access to the estimator of gradients with variance bound for loss function , i.e., for all , training with initial point and constant step sizes , where , achieves

(4)
Remark 1

When the gradient vector calculated through the gradient descent (GD) algorithm and the model is updated with gradient sequence in error-free transmission, the variance bound , the convergence rate achieve the best , thus the edge server is interested in this sequence. When the gradient vector calculated through minibatched SGD algorithm, and the model is updated with the unbiased estimator in error-free transmission, the variance bound given the global batch size . However, the error-free transmission is infeasible in practical due to communication resource limitations. The local gradients have to be quantized and the unbiased estimation can only be based on these quantized values. Hence, the variance of the unbiased estimator based on quantized gradients must be larger than at each iteration .

In general, convergence rate increases with the variance bound of gradient estimator given the number of the iterations while the communication bits per iteration decreases with the variance bound of gradient estimator. To obtain the trade-off between the convergence rate and the communication cost, we need to find out what is the minimum achievable rate at a particular variance upper bound. It is a basic problem in rate distortion theory.

Iii Multi-terminal source coding problem in FL

In this section, in order to accurately formulate the indirect multiterminal source coding problem in federated learning, we first study the distribution of the global gradients and the local gradients in FL. Then we formulate a quadratic vector Gaussian CEO problem based on a thorough understanding of gradient distributions.

Iii-a Gradient distribution

Global Gradient

Recall that the edge server is interested in this sequence , which is the global gradient vector sequence calculated through the gradient descent (GD) algorithm. The following are key assumptions on the distribution of

  • The gradient is independent distributed vector Gaussian sequence with zero mean. The Gaussian distribution is valid since the field of probabilistic modeling uses Gaussian distributions to model the gradient [30, 15, 12, 23]. The assumption of independence is valid when the learning rate is large enough.

  • The gradient are independent over gradient vector dimension ’s. This assumption is valid as long as the features in a data sample are independent but non-identically distributed, which is typically the case. Even if the gradient are strongly correlated over dimension ’s, the gradient can be de-correlated by the regularization methods such as sparsity-inducing regularization [10, 28] and parameter sharing/tying [11, 9].

We perform experiments on dataset MNIST to justify the assumption of global gradients . We evenly sampled 25000 gradients in iterations [1, 10].

(a) The distribution of global gradient at different dimension.
(b) The distribution of global gradient at different iteration.
Fig. 3: The distribution of global gradient.

Fig. 3 illustrates the experimental results of the distribution of global gradient of dataset MNIST. It is observed that the global gradient follow a Gaussian distribution with mean zero. Fig. 3(a) shows 6 dimensions on two layers of global gradient in iteration 10. The variance of global gradient on the same layer are similar and the variance of global gradient on the different layer are different. Fig. 3(b) shows one dimension of global gradient in iteration [1, 10]. It is observed that the variance of the global gradient gradually decreases with the model converges.

(a) Correlation coefficient of global gradient over dimension ’s.
(b) Correlation coefficient of global gradient over iteration ’s.
Fig. 4: Correlation coefficient of global gradient.

The correlation coefficient of Gaussian variables can indicate the independence of them. Fig. 4 shows the correlation coefficient of global gradient over iteration ’s and dimension ’s. Fig. 4(a) shows the correlation coefficient of global gradient over dimension ’s. It is observed that the correlation coefficient of gradients on two different dimension are almost zero. The global gradient sequence are almost independent over dimension ’s. Fig. 4(b) shows the correlation coefficient of global gradient over iteartion ’s. It is observed that the correlation coefficient of gradients on two adjacent iterations are large at the beginning of the training and are almost zero as the training progresses. The global gradient sequence are almost independent over iteration ’s.

Local Gradients

For , edge device carries out Minibatched SGD in FL. Recall that is the local gradient vector sequence calculated through the Minibatched SGD algorithm at device . The local gradient vector sequence can be viewed as noisy versions of and corrupted by additive noise, i.e., . The following are key assumptions on the distribution of for :

  • The gradient noise are Gaussian random vectors independent of the process. The mean of is zero in IID data setting and non-zero in non-IID data setting. Note that the gradient noise the depends on the selection of local batch at edge device . The assumption of independence is valid since the selection of local batch is independence of the global gradient .

  • The gradient noise are independent and non-identical distributed over devices ’s. This assumption is valid as long as the selection of the local batch are independent and non-identical over edge device .

  • The gradient noise are independent and non-identical distributed over dimension ’s. The reason for this assumption is similar to that of gradient .

We perform experiments to justify the assumption that gradients noises are Gaussian distribution and independent over iteration ’s, devices ’s and dimension ’s.

(a) IID data setting.
(b) Non-IID data setting.
Fig. 5: The conditional distribution of local gradient given global gradient in IID data setting and non-IID data setting in IID data setting and non-IID data setting, where the blue dot represents the given global gradient .

Fig. 5 illustrates the experimental results of the conditional distribution of local gradient given global gradient in IID data setting and non-IID data setting. We sampled 10000 local gradients from one edge device with IID dataset and one edge device with non-IID dataset within one iteration, respectively. It is observed that the conditional local gradient given follows a Gaussian distribution, i.e., the gradient noise follow a Gaussian distribution. From Fig. 5(a), it is observed that the conditional expectation of local gradient is equal to global gradient, i.e.,

(5)

Hence, we have justified that the mean of noise is zero in IID data setting. From Fig. 5(b), it is observed that , similarly we have justified that the mean of noise is non-zero in non-IID data setting.

(a) Correlation coefficient of noise over iteration ’s.
(b) Correlation coefficient of noise over device ’s.
(c) Correlation coefficient of noise over dimension ’s.
Fig. 6: Correlation coefficient of gradient noise over iteration, device and dimension.

Fig. 6 illustrates the correlation coefficient of gradient noise over iteration , device and dimension . We sample 1000 times in each gradient noise for and . We randomly select 10-dimensional in the gradient noise to show the correlation coefficient over dimension . It is observed that the correlation coefficient of gradient noise are almost zero over dimension ’s, device ’s and iteration ’s.. Hence, the noise sequence are independent over dimension ’s, device ’s and iteration ’s.

Iii-B Problem Formulation

In this subsection, we formulate the quadratic vector gaussian CEO problem based on the observation of the distribution of the global gradient and the local gradient . Let the global gradient be an independent Gaussian vector sequence with mean 0 and variance . Each takes value in real space . For , Let the local gradient be noisy version of , each taking value in real space and corrupted by independent additive white Gaussian noise, i.e.,

(6)

where are Gaussian random vectors independent over device , dimension and iteration . For , and , we assume that is a centralized Gaussian variable with mean 0 and variance .

Fig. 7: The Gaussian vector CEO problem in FL.

Fig. 7 shows the CEO Problem in FL. The edge server (CEO) is interested in the sequence that cannot be observed directly. The edge server employs a team of edge devices (agents) who observe independently corrupted versions of . We write independent copies of and as and , respectively. To facilitate the following derivation, we omit iteration index . For , each local gradient sequence observed by edge device is separately encoded to , and those are sent to the information processing center, where the edge server observes and outputs the estimation of by using the decoder function . The encoder function are defined by

(7)

and satisfy the total rate constraint

(8)

We write an -tuple of encoder functions as

(9)

Similarly, we write

(10)

The decoder function is defined by

(11)

For , define the average mean squared error (MSE) distortion by

(12)

For a target distortion , a rate -tuple is said to be achievable if there are encoders satisfying (8) and decoder such that is unbiased estimator of , i.e., and for some . The closure of the set of all achievable rate -tuples is called the rate-region and we denote it by . Our aim is to characterize the region in an explicit form.

Remark 2

This problem is difference from the existing Quadratic Gaussian CEO works [25, 26, 33] in the following aspects:

  • The estimator of our CEO Problem must be unbiased and the distortion is the variance of estimator while existing Quadratic Vectored Gaussian CEO Problem use MSE as distortion.We choose this distortion function is because MSE cannot guarantee the convergence of federated learning. Recall that Theorem 1 require that the gradient estimator is unbiased.Intuitively, the effective information and noise in the gradient estimator will be enlarged or reduced in proportion to the learning rate , and the learning rate can be adjusted. Hence, the convergence rate depends on the ratio between the effective information and noise in the gradient estimator and not on the absolute noise in the gradient estimator. The MSE can only measure the absolute noise in the gradient estimator, but cannot measure the effective information in the gradient estimator. Our unbiased gradient estimator fixes the effective information in the gradient estimator and the variance measures the noise in the gradient estimator.

  • Second, existing works assume that the sequence length is large enough when deriving the achievability of the rate region. However, the local gradient sequence cannot jointly encode, i.e, the sequence length in FL due to the fact that the local gradient must be transmitted to edge server before the local gradient is available. Therefore, the rate distortion function results may not be achievable in practice and is actually a lower bound of achievable rate. Specifically, see Theorem 3 in section VII.

Iv Rate Region Results

Let be the Berger-Tung achievable region using Gaussian auxiliary random variables for CEO problem in FL. We show that

(13)
(14)
(15)
(16)

where

(17)

and

(18)

where is the variance of global gradient and is the variance of the gradient noise .

Our main result is

Theorem 2
(19)
Remark 3

As will become clear in the next section, parameter can be interpreted as the distortion of gradient estimator at dimension , can be interpreted as the rate the -th edge device spends in quantizing its observation noise at dimension and can be interpreted as the rate contributed by edge device on dimension . The set contains all feasible such that the distortion at dimension satisfying . The set is empty when and is empty when . Moreover, and contains more elements as and increase, respectively. Given the , the set can be interpreted as the sub-rate region of dimension . The -th sub-rate region contains less elements as increases. For , there is an element in rate region corresponding to it, satisfying . Since must be convex, we can conclude that is also convex. This can also be directly inferred by noting that given the distortion , the rate region is a convex set.

The converse and direct parts of the proof of this theorem will be given in Sections V. We prove this theorem from the standpoint of multiterminal rate-distortion theory.

V Proof

V-a Achievability of Theorem 2

The achievable proof is based on the Berger-Tung inner bound developed by Berger [5].

Lemma 1 (Berger-Tung inner bound)

If we can find auxiliary random variables such that

(20)

and decoding function such that

(21)

then the following rate region is achievable

(22)

Now, let us consider the Berger-Tung scheme in FL. For each edge device, we define the auxiliary random variable , where are independently distributed and independent of and . Parameters are determined in terms of the target distortion . After recovering , the decoder reconstructs by applying the following weighted averaging function component-wise

(23)

where . Note that is unbiased estimator of and the variance of is . We set this equal to the target distortion , which is given by

(24)

where

(25)

Let us define

(26)

We can interpret as the rate the -th device spends in quantizing its -th dimension of the observation noise. We will use ’s as the parameters instead of . Note that for any choice of , we can find a corresponding and therefore, a set of auxiliary random variables. Then we can rewrite (25) in terms of ’s as

(27)

as desired.

From the Berger-Tung inner bound, are achievable if for all non-empty

(28)
(29)
(30)
(31)
(32)
(33)
(34)
(35)

Equations (27) and (35) together concludes the achievable proof.

V-B Converse of Theorem 2

Our converse proof is inspired by Oohama’s converse in [25]. Suppose we achieve distortion , where . Let denote all the messages produced by the edge devices after observing an -block. Let us define

(36)

For any ,

(37)
(38)
(39)
(40)
(41)
(42)
(43)
(44)

where (38) follows from the fact that conditioning reduces entropy, (40) follows from the fact that are Markov chain, (41) follows from the chain rule for mutual information, (42) follows from the fact that are Markov chain, (43) follows from the chain rule for mutual information and the fact that are Markov chain.

Lemma 2

Let be any unbiased estimator of with distortion . We have

(45)
{proof}
(46)
(47)
(48)
(49)

The conditional variance of variable given the unbiased estimator is given by

(50)
(51)
(52)
(53)
(54)
(55)
(56)

where (54) and (55) follows from the fact that . Substituting (56) to (49), we have

(57)
(58)
(59)

where (58) follows from Jensen¡¯s inequality. The proof of Lemma 2 is completed. We have a simple lower-bound for the first term based on Lemma 2

(60)