Uncertainty Principle for Communication Compression in Distributed and Federated Learning and the Search for an Optimal Compressor
In order to mitigate the high communication cost in distributed and federated learning, various vector compression schemes, such as quantization, sparsification and dithering, have become very popular. In designing a compression method, one aims to communicate as few bits as possible, which minimizes the cost per communication round, while at the same time attempting to impart as little distortion (variance) to the communicated messages as possible, which minimizes the adverse effect of the compression on the overall number of communication rounds. However, intuitively, these two goals are fundamentally in conflict: the more compression we allow, the more distorted the messages become. We formalize this intuition and prove an uncertainty principle for randomized compression operators, thus quantifying this limitation mathematically, and effectively providing lower bounds on what might be achievable with communication compression. Motivated by these developments, we call for the search for the optimal compression operator. In an attempt to take a first step in this direction, we construct a new unbiased compression method inspired by the Kashin representation of vectors, which we call Kashin compression (KC). In contrast to all previously proposed compression mechanisms, we prove that KC enjoys a dimension independent variance bound with an explicit formula even in the regime when only a few bits need to be communicate per each vector entry. We show how KC can be provably and efficiently combined with several existing optimization algorithms, in all cases leading to communication complexity improvements on previous state of the art.
- 1 Introduction
- 2 Uncertainty principle for compression operators
- 3 Compression with regular polytopes
- 4 Compression with Kashin’s representation
- 5 Measure concentration and orthogonal matrices
- 6 Experiments
- 7 Conclusion and future plans
- A Proofs for Section 2
- B Proofs for Section 5
In the quest for high accuracy machine learning models, both the size of the model and consequently the amount of data necessary to train the model have been hugely increased over time (Schmidhuber, 2015; Vaswani et al., 2019). Because of this, performing the learning process on a single machine is often infeasible. In a typical scenario of distributed learning, the training data (and possibly the model as well) is spread across different machines and thus the process of training is done in a distributed manner (Bekkerman et al., 2011; Vogels et al., 2019). Another scenario, most common to federated learning (Konečný et al., 2016; McMahan et al., 2017; Karimireddy et al., 2019a), is when training data is inherently distributed across a large number of mobile edge devices due to data privacy concerns.
1.1 Communication bottleneck
In all cases of distributed learning and federated learning, information (e.g. current stochastic gradient vector or current state of the model) communication between computing nodes is inevitable, which forms the primary bottleneck of such systems (Zhang et al., 2017; Lin et al., 2018). This issue is especially apparent in federated learning, where computing nodes are devices with essentially inferior power and the network bandwidth is considerably slow (Li et al., 2019).
There are two general approaches to address/tackle this problem. One line of research dedicated to so-called local methods suggests to do more computational work before each communication in the hope that those would increase the worth/impact/value of the information to be communicated (Goyal et al., 2017; Wangni et al., 2018; Stich, 2018; Khaled et al., 2020). An alternative approach investigates inexact/lossy information compression strategies which aim to send approximate but relevant information encoded with less number of bits. In this work we focus on the second approach of compressed learning. Research in this latter stream splits into two orthogonal directions. To explore savings in communication, various (mostly randomized) compression operators have been proposed and analyzed such as random sparsification (Konečný & Richtárik, 2018; Wangni et al., 2018), top- sparsification (Alistarh et al., 2018), standard random dithering (Goodall, 1951; Roberts, 1962; Alistarh et al., 2017), natural dithering (Horváth et al., 2019a), ternary quantization (Wen et al., 2017), and scaled sign quantization (Karimireddy et al., 2019b; Bernstein et al., 2018, 2019; Liu et al., 2019). Table 1 summarizes the most common compression methods with their variances and the number of encoding bits.
In designing a compression operator, one aims to (i) encode the compressed information with as few bits as possible, which minimizes the cost per communication round, and (ii) introduce as little noise (variance) to the communicated messages as possible, which minimizes the adverse effect of the compression on the overall iteration complexity.
|Compression Method||Unbiased?||Variance||Variance||Bits (in binary32)|
|Scaled Sign Quantization||no|
|Kashin Compression (new)||yes||
1.2 Compressed learning
In order to utilize these compression methods efficiently, a lot of research has been devoted to the study of learning algorithms with compressed communication. Obviously, the presence of compression in a learning algorithm affects the training process and since compression operator encodes the original information approximately, it should be anticipated to increase the number of communication rounds. Table 2 highlights four gradient-type compressed learning algorithms with their corresponding setup and iteration complexity:
distributed Gradient Descent (GD) with compressed gradients (Khirirat et al., 2018),
distributed Stochastic Gradient Descent (SGD) with gradient quantization and compression variance reduction (Horváth et al., 2019b),
distributed SGD with bi-directional gradient compression (Horváth et al., 2019a), and
distributed SGD with gradient compression and twofold error compensation (Tang et al., 2019).
|Optimization Algorithm||Objective Function||Iteration complexity|
|Compressed GD (Khirirat et al., 2018)||smooth, strongly convex|
|DIANA (Horváth et al., 2019b)||smooth, strongly convex|
|Distributed SGD (Horváth et al., 2019a)||smooth, non-convex|
|DoublSqueeze (Tang et al., 2019)||smooth, non-convex|
In all cases, the iteration complexity depends on the variance ( or ) of the underlying compression scheme and grows as more compression is applied. For this reason, we are interested in compression methods which save in communication by using less bits and minimize iteration complexity by introducing lower variance. However, intuitively and also evidently from Table 1, these two goals are in fundamental conflict, i.e. requiring fewer bits to be communicated in each round introduces higher variance, and demanding small variance forces more bits to be communicated.
The contributions of our work are:
Uncertainty Principle. We formalize this intuitive trade-off and prove an uncertainty principle for randomized compression operators, which quantifies this limitation mathematically with the inequality
where is the normalized variance / contraction factor associated with the compression operator (Definition 1), is the number of bits required to encode the compressed vector and is the dimension of the vector to be compressed. The notion of Uncertainty Principle (UP) for compression operators is introduced and theoretically proved in this paper. It is a universal property of compressed communication, completely independent of the optimization algorithm and the problem that distributed training is trying to solve. We visualize this fascinating principle in Figure 1, where we computed many possible combinations of parameters and for various compression methods. The dashed red line indicating the lower bound (1) bounds all possible combinations of all compression operators, thus validating the obtained uncertainty principle for randomized compression operators.
Kashin Compression. Motivated by this principle, we then focus on the search for the optimal compression operator. In an attempt to take a first step in this direction, we design a new unbiased compression operator inspired by Kashin representation of vectors (Kashin, 1977), which we call Kashin Compression (KC). In contrast to all previously proposed compression methods, we prove that KC enjoys a dimension independent variance bound even in a severe compression regime when only a few bits per coordinate can be communicated. We give an explicit formula for the variance bound and show how KC can be provably and efficiently combined with several existing optimization algorithms, in all cases leading to communication complexity improvements on previous state of the art. We believe that KC has the potential to play a role in the discovery of an optimal compression method, perhaps when composed with some other operators, such as dithering.
Experimental Validations. In our experiments, we observed the superiority of KC in terms of communication savings and stabilization property when compared against a vast array of compressors proposed in the literature. In particular, Figure 1 justifies that KC combined with Top- sparsification and dithering operators yields a compression method which almost closes the gap to the UP. Kashin’s representation has been used heuristically in federated learning (Caldas et al., 2019) to mitigate the communication cost. In contrast to this work, we generate the initial tight frame of KC randomly as suggested by the theory, and tune the parameters accordingly. Moreover, we consider combinations of KC and other compression techniques such as ternary quantization, Top- sparsification and dithering. We believe KC should be of high interest in federated and distributed learning.
2 Uncertainty principle for compression operators
In general, an uncertainty principle refers to any type of mathematical inequality expressing some fundamental trade-off between two measurements. The classical Heisenberg’s uncertainty principle in quantum mechanics (Heisenberg, 1927) shows the trade-off between the position and momentum of a particle. In harmonic analysis, the uncertainty principle limits the localization of values of a function and its Fourier transform at the same time (Havin & Jöricke, 1994). Alternatively in the context of signal processing, signals cannot be simultaneously localized in both time domain and frequency domain (Gabor, 1946). The uncertainty principle in communication deals with the quite intuitive trade-off between information compression (encoding bits) and approximation error (variance), namely more compression forces heavier distortion to communicated messages and tighter approximation requires less information compression.
In this section, we present our UP for communication compression revealing the trade-off between encoding bits of compressed information and the variance produced by compression operator. First, we describe our UP for a general class of biased compressions. Afterwards, we specialize it to the class of unbiased compressions.
2.1 UP for biased compressions
We work with the class of biased compression operators which are contractive.
Definition 1 (Biased Compressions)
Let be the class of biased (and possibly randomized) compression operators with contractive property, i.e. for any
The parameter can be seen as the normalized variance of the compression operator. Note that the compression does not need to be randomized to belong to this class. For instance, Top- sparsification operator satisfies (2) without the expectation for . Next, we formalize our uncertainty principle for the class .
Let be any compression operator from and be the total number of bits needed to encode the compressed vector for any . Then the following form of uncertainty principle holds
One can view the binary32 and binary64 floating-points formats as biased compression methods for the actual real numbers (i.e. ), using only 32 and 64 bits respectively to represent a single number. Intuitively, these formats have their precision (i.e. ) limits and the uncertainty principle (3) shows that the precision cannot be better than for binary32 format and for binary64 format. Thus, any floating-point format representing a single number with bits has precision constraint of , where the base stems from the binary nature of the bit.
Furthermore, notice that compression operators can achieve zero variance in some settings, e.g. ternary or scaled sign quantization when (see Table 1). On the other hand, the UP (3) implies that the normalized variance for any finite bits . The reason for this inconsistency comes from the fact that, for instance, the binary32 format encodes any number with 32 bits and the error is usually ignored in practice. We can adjust our UP to any digital format, using bits per single number, as
2.2 UP for unbiased compressions
We now specialize our UP to the class of unbiased compressions. First, we recall the definition of unbiased compression operators with a given variance.
Definition 2 (Unbiased Compressions)
Denote by the class of unbiased compression operators with variance , that is, for any
To establish an uncertainty principle for , we show that all unbiased compression operators with the proper scaling factor are included in .
If , then .
Using this inclusion, we can apply Theorem 1 to the class and derive an uncertainty principle for unbiased compression operators.
Let be any unbiased compression operator with variance and be the total number of bits needed to encode the compressed vector for any . Then the uncertainty principle takes the form
3 Compression with regular polytopes
Here we describe an unbiased compression scheme based on regular polytopes. With this particular compression we illustrate that it is possible for unbiased compressions to have dimension independent variance bounds and at the same time communicate a few bits per coordinate.
Let be the vector that we need to communicate. First, we project the vector on the unit sphere
thus separating the magnitude from the direction . The magnitude is a dimension independent scalar value and we can transfer it cheaply, say by 32 bits. To encode the unit vector we approximate the unit sphere by regular polytopes and then randomize over the vertices of the polytope. Polytopes can be seen as generalizations of planar polygons in high dimensions. Formally, let be a regular polytope with vertices such that it contains the unit sphere, i.e. , and all vertices are on the sphere of radius . Then, any unit vector can be expressed as a convex combination with some non-negative weights . Equivalently, can be expressed as an expectation of a random vector over with probabilities . Therefore, the direction could be encoded with roughly bits and the variance of compression will depend on the approximation, more specifically . Kochol (2004, 1994) gave a constructive proof on approximation of the dimensional unit sphere by regular polytopes with vertices for which . So, choosing the number of vertices to be , we get an unbiased compression operator with variance (independent of dimension ) and with 1 bit per coordinate encoding.
However, this method does not seem to be practical as vertices of the polytope either need to be stored or computed each time they are used, which is infeasible for large dimensions.
4 Compression with Kashin’s representation
In this section we introduce the notion of Kashin’s representation, the algorithm of Lyubarskii & Vershynin (2010) on computing it efficiently and then describe the quantization step.
4.1 Representation systems
The most common way of compressing a given vector is to use its orthogonal representation with respect to the standard basis in :
However, the restriction of orthogonal expansions is that coefficients are independent in the sense that if we lost one of them, then we cannot recover it even approximately. Furthermore, each coefficient may carry very different portion of the total information that vector contains; some coefficients may carry more information than others and thus be more sensitive to compression.
For this reason, it is preferable to use tight frames and frame representations instead. Tight frames are generalizations of orthonormal bases, where the system of vectors are not required to be linearly independent. Formally, vectors in form a tight frame if any vector admits a frame representation
Clearly, if (the case we are interested in), then the system is linearly dependent and hence the representation (7) with coefficients is not unique. The idea is to exploit this redundancy and choose coefficients in such a way to spread the information uniformly among these coefficients. However, the frame representation may not distribute the information well enough. Thus, we need a particular representation for which coefficients have smallest possible dynamic range.
For a frame define the frame matrix by stacking frame vectors as columns. It can be easily seen that being a tight frame is equivalent to frame matrix to be orthogonal, i.e. , where is the identity matrix. Using the frame matrix , frame representation (7) takes the form .
Definition 3 (Kashin’s representation)
Let be a tight frame in . Define Kashin’s representation of with level the following expansion
Existence. It turns out that not every tight frame can guarantee Kashin’s representation with constant level. The following existence result is based on Kashin’s theorem (Kashin, 1977):
There exist tight frames in with arbitrarily small redundancy , and such that every vector admits Kashin’s representation with level that depends on only (not on or ).
4.2 Computing Kashin’s representation
To compute Kashin’s representation we use the algorithm developed by Lyubarskii & Vershynin (2010), which transforms the frame representation (7) into Kashin’s representation (8). The algorithm requires tight frame with frame matrix satisfying the restricted isometry property:
Definition 4 (Restricted Isometry Property (RIP))
A given matrix satisfies the Restricted Isometry Property with parameters if for any
In general, for an orthogonal matrix we can only guarantee the inequality if . The RIP requires to be a contraction mapping for sparse . With a frame matrix satisfying RIP, the analysis of Algorithm 1 from (Lyubarskii & Vershynin, 2010) yields a formula for the level of Kashin’s representation:
Let be a tight frame in which satisfies RIP with parameters . Then any vector admits a Kashin’s representation with level
4.3 Quantizing Kashin’s representation
We utilize Kashin’s representation to design a compression method, which will enjoy dimension-free variance bound on the approximation error. Let be the vector that we want to communicate and be the redundancy factor so that is positive integer. First we find Kashin’s representation of , i.e. for some , and then quantize coefficients using any unbiased compression operator that preserves the sign and maximum magnitude:
For example, ternary quantization or any dithering (standard random, natural) can be applied. The vector that we communicate is the quantized coefficients and KC is defined via
Due to unbiasedness of and linearity of expectation, we preserve unbiasedness for :
Then we bound the error of approximation uniformly (without the expectation) as follows
The obtained uniform upper bound does not depend on the dimension . It depends only on the redundancy factor which should be chosen depending on how less we want to communicate. Thus, KC with any unbiased quantization (11) belongs to . Note, that we are not restrained to use only unbiased compressions with Kashin’s representation. For instance, instead of random sparsification (which is unbiased and satisfies (11)) one can use Top- sparsification, which satisfies (11) and in practice works much better despite having similar theoretical properties.
5 Measure concentration and orthogonal matrices
The concentration of the measure is a remarkable high-dimensional phenomenon which roughly claims that a function defined on a high-dimensional space and having small oscillations takes values highly concentrated around the average (Ledoux, 2001; Giannopoulos & Milman, 2000). Here we present one example of such concentration for Lipschitz functions on the unit sphere, which will be the key to justify the restricted isometry property.
5.1 Concentration on the sphere for Lipschitz functions
Let be the unit sphere. We say that is a Lipschitz function with constant if
for any .
Let be a random vector uniformly distributed on the unit Euclidean sphere. If is Lipschitz function, then for any
Informally and rather surprisingly, Lipschitz functions on a high-dimensional unit sphere are almost constants. Particularly, it implies that deviations of function values from the average are at most with confidence level more than 0.99. We will apply this concentration inequality for the function which is Lipschitz if is orthogonal.
5.2 Random orthogonal matrices
Up to this point we did not discuss how to choose the frame vectors or the frame matrix , which is used in the construction of Kashin’s representation. We only know that it should be orthogonal and satisfy RIP for some parameters . We now describe how to construct frame matrix and how to estimate parameters . Unluckily, there is no an explicit construction scheme for such matrices. There are random generation processes that provide probabilistic guarantees (Candès & Tao, 2005, 2006; Lyubarskii & Vershynin, 2010).
Consider random matrices with orthonormal rows. Such matrices are obtained from selecting the first rows of orthogonal matrices. Let be the space of all orthogonal matrices with the unique translation invariance and normalized measure, which is called Haar measure for that space. Then the space of orthogonal matrices is
where is the orthogonal projection on the first coordinates. The probability measure on is induces by the Haar measure on . Next we show that, with respect to the normalized Haar measure, randomly generated orthogonal matrices satisfy RIP with high probability.
Let and , then with probability at least
a random orthogonal matrix satisfies RIP with parameters
Note that the expression for the probability can be negative if is too close to 1. Specifically, the logarithmic term vanishes for giving negative probability. However, the probability approaches to quite rapidly for bigger ’s. To get a sense of how high that probability can be, note that for variables and inflation it is bigger than .
Now that we have explicit formulas for the parameters and , we can combine it with the results of Section 4 and summarize with the following theorem.
Let be the redundancy factor and be any unbiased compression operator satisfying (11). Then Kashin Compression is an unbiased compression with dimension independent variance
In this section we describe the implementation details of KC and present our experiments of KC compared to other popular compression methods in the literature.
6.1 Implementation details of KC
To generate a random (fat) orthogonal frame matrix , we first generate a random matrix with entries drown independently from Gaussian distribution. Then we extract an orthogonal matrix by applying QR decomposition. Note that, for big dimensions the generation process of frame matrix becomes computationally expensive. However, after fixing the dimension of to-be-compressed vectors then the frame matrix needs to be generated only once and can be used throughout the learning process.
Afterwards, we turn to the estimation of the parameters and of RIP, which are necessary to compute Kashin’s representations. These parameters are estimated iteratively so to minimize the representation level (10) subject to the constraint (9) of RIP. For fixed we first find the least such 9 holds for unit vectors, which were obtained by normalizing Gaussian random vectors (we chose sample size of , which provided a good estimate). Then we tune the parameter (initially chosen ) to minimize the level (10).
6.2 Empirical variance comparison
We empirically compare the variance produced by natural dithering against KC with natural dithering and observe that latter introduces much less variance. We generated vectors with independent entries from standard Gaussian distribution. Then we fix the minimum number of levels that allows obtaining an acceptable variance for performing KC with natural dithering. Next, we adjust levels for natural dithering to the almost same number of bits used for transmission of the compressed vector. For each of these vectors we compute normalized empirical variance via
In Figure 2 we provide boxplots for empirical variances, which show that the increase of parameter leads to smaller variance for KC. They also confirm that for natural dithering, the variance scales with the dimension while for KC that scaling is significantly reduced (see also Table 1 for variance bounds). This shows the positive effect of KC combined with other compression methods. For additional insights, we present also swarmplots provided by Seaborn Library. Figure 3 illustrates the strong robustness property of KC with respect to outliers.
6.3 Minimizing quadratics with CGD
To illustrate the advantages of KC in optimization algorithms, we minimized randomly generated quadratic functions (15) for using gradient descent with compressed gradients.
In Figure 3(a) we evaluate functional suboptimality
in log-scale for vertical axis. These plots illustrate the superiority of KC with ternary quantization, where it does not degrade the convergence at all and saves in communication compared to other compression methods and without any compression scheme.
To provide more insights into this setting, Figure 3(b) visualizes empirical variances of the compressed gradients throughout the optimization process, revealing both the low variance feature and the stabilization property of KC.
6.4 Minimizing quadratics with distributed CGD
Consider the minimization problem of the average of quadratics
with synthetically generated matrices . We solve this problem with Distributed Compressed Gradient Descent (Algorithm 2) using a selection of compression operators.
Figures 5 and 6 show that KC combined with ternary quantization leads to faster convergence and uses less bits to communicate than ternary quantization alone. Note that in higher dimension the gap between KC with ternary quantization and no compression gets smaller in the iteration plot, while in the communication plot it gets bigger. So, in high dimensions KC convergences slightly worse than no compression scheme, but the savings in communication are huge.
7 Conclusion and future plans
We formalized, for the first time, the limitation of (randomized) compression operators in communication and mathematically proved an uncertainty principle for communication compression. We also presented a highly robust new—Kashin compressor (KC)—and showed that in combinations with some other compression methods gives almost optimal compression, thus closing the gap established by our uncertainty principle. As a future work, we plan to implement a sparse and efficient generation of large-size random orthogonal matrices using block structured small-size orthogonal matrices. This should reduce both the storage requirement and the computational effort to use KC in practical applications.
Appendix A Proofs for Section 2
a.1 Proof of Theorem 1: UP for biased compressions
Fix and let be the -dimensional Euclidean closed ball with center at the origin and with radius . Denote by the number of possible outcomes of compression operator and by the set of compressed vectors. We relax the -contractive requirement and prove (3) in the case when the restricted compression operator satisfies
Define probability functions as follows
Then we stack functions together and get a vector valued function , where is the standard -simplex
We can express the expectation in (17) as
and taking into account the inequality (17) itself, we conclude
The above inequality holds for the particular probability function defined from the compression . Therefore the inequality will remain valid if we take the minimum of left hand side over all possible probability functions :
We then swap the order of min-max by adjusting domains properly:
where the second minimum is over all probability vectors (not over vector valued functions as in the first minimum). Next, notice that
where is the closest to . Therefore, we have transformed (19) into
The last inequality means that the set is an -net for the ball . Using the following result on covering numbers and volume (see Proposition 4.2.12, (Vershynin, 2018)) we conclude
which completes the proof since
a.2 Proof of Lemma 1
Let . Using relations and , we get
which concludes the lemma.
Appendix B Proofs for Section 5
b.1 Proof of Theorem 5: Concentration on the sphere for Lipschitz functions
Let be the unit sphere with the normalized Lebesgue measure and the geodesic metric representing the angle between and . Using this metric, we define the spherical caps as the balls in :
For a set and non-negative number denote by the -neighborhood of with respect to geodesic metric:
The famous result of P. Levy on isoperimetric inequality for the sphere states that among all subsets of a given measure, the spherical cap has the smallest measure for the neighborhood (see e.g. (Ledoux, 2001)).
Theorem 8 (Levy’s isoperimetric inequality)
Let be a closed set and let . If is a spherical cap with , then
We also need the following upper bound on the measure of spherical caps
Let . If is a spherical cap with radius , then
These two results yield a concentration inequality on the unit sphere around median of the Lipschitz function.
Let be a -Lipschitz function (w.r.t. geodesic metric
Then, for any
Proof of Theorem 9: Concentration around the median Without loss of generality we can assume that . Denote
Note that implies that for some . Using the Lipschitzness of we get . Analogously, implies that for some . Again, the Lipschitzness of gives . Thus
To complete the proof, it remains to combine this with inequalities for measures of complements
Continuity of and give the result with the relaxed inequality.
Again, without loss of generality we assume that and . Fix and decompose the set into two parts:
where is a median of . From the concentration (21) around the median, we get an estimate for
Now we want to estimate the second term with a similar upper bound so to combine them. Obviously, the condition in does not depend on , and it is a piecewise constant function of . Therefore
where we bounded as follows
We further upper bound to get the same exponential term as for :
To check the validity of the latter upper bound, first notice that for both are equal to 1. Then, the monotonicity and positiveness of the exponential function imply (22) for and . Combining these two upper bounds for and , we get
if we set . To conclude the theorem, note that normalized uniform measure on the unit sphere can be seen as a probability measure on .
b.2 Proof of Theorem 6: Random orthogonal matrices with RIP
Let be fixed. Any orthogonal matrix can be represented as the projection of orthogonal matrix . The uniform probability measure (or Haar measure) on ensures that if is random then the vector is uniformly distributed on . Therefore, if is random with respect to the induced Haar measure on , then random vectors and have identical distributions. Denote and notice that is -Lipschitz on the sphere . To apply the concentration inequality (23), we compute the expected norm of these random vectors:
where we used the fact that coordinates are distributed identically and therefore they have the same mean. Applying inequality (23) yields, for any
Let be the set of vectors with at most non-zero elements
where denotes the subset of vectors