Coded Distributed Computing:Performance Limits and Code Designs

# Coded Distributed Computing: Performance Limits and Code Designs

Mohammad Vahid Jamali, Mahdi Soleymani, and Hessam Mahdavifar
Department of Electrical Engineering and Computer Scienc, University of Michigan, Ann Arbor, MI 48109, USA
E-mails: {mvjamali, mahdy, hessam}@umich.edu
###### Abstract

We consider the problem of coded distributed computing where a large linear computational job, such as a matrix multiplication, is divided into smaller tasks, encoded using an linear code, and performed over distributed nodes. The goal is to reduce the average execution time of the computational job. We provide a connection between the problem of characterizing the average execution time of a coded distributed computing system and the problem of analyzing the error probability of codes of length used over erasure channels. Accordingly, we present closed-form expressions for the execution time using binary random linear codes and the best execution time any linear-coded distributed computing system can achieve. It is also shown that there exist good binary linear codes that attain, asymptotically, the best performance any linear code, not necessarily binary, can achieve. We also investigate the performance of coded distributed computing systems using polar and Reed-Muller (RM) codes that can benefit from low-complexity decoding, and superior performance, respectively, as well as explicit constructions. The proposed framework in this paper can enable efficient designs of distributed computing systems given the rich literature in the channel coding theory.

## I Introduction

There has been increasing interest in recent years toward applying ideas from coding theory to improve the performance of various computation, communication, and networking applications. For example, ideas from repetition coding has been applied to several setups in computer networks, e.g., by running a request over multiple servers and waiting for the first completion of the request by discarding the rest of the request duplicates [ananthanarayanan2013effective, vulimiri2013low, gardner2015reducing]. Another direction is to investigate the application of coding theory in cloud networks and distributing computing systems [jonas2017occupy, lee2018speeding]. A rule of thumb is that when the computational job consists of linear operations, coding techniques can be applied to improve the run-time performance of the system under consideration.

Distributed computing refers to the problem of performing a large computational job over many, say , nodes with limited processing capabilities. A coded computing scheme aims to divide the job to tasks and then to introduce redundant tasks using an code, in order to alleviate the effect of slower nodes, also referred to as stragglers. In such a setup, it is assumed that each node is assigned one task and hence, the total number of encoded tasks is equal to the number of nodes.

Recently, there has been extensive research activities to leverage coding schemes in order to boost the performance of distributed computing systems [lee2018speeding, li2016unified, reisizadeh2017coded, li2016coded, li2017coding, yang2017computing, lee2017high, dutta2016short, vulimiri2013low, wang2018coded, mallick2018rateless]. For example, [lee2018speeding] has applied coding theory to combat the deteriorating effects of stragglers in matrix multiplication and data shuffling. The authors in [reisizadeh2017coded] considered coded distributed computing in heterogeneous clusters consisting of servers with different computational capabilities.

Most of the work in the literature focus on the application of maximum distance separable (MDS) codes. However, encoding and decoding of MDS codes over real numbers, especially when the number of servers is large, e.g., more than , face several barriers, such as numerical stability, and decoding complexity. In particular, decoding of MDS codes is not robust against unavoidable rounding errors when used over real numbers [higham2002accuracy]. Employing large finite fields, e.g., coded matrix multiplication using polynomial codes in [yu2017polynomial], can be an alternative approach. However, applying large finite fields imposes further numerical barriers due to quantization when used over real-valued data.

As we will show in Section III, MDS codes are theoretically optimal in terms of minimizing the average execution time of any linear-coded distributed computing system. However, as discussed above, their application comes with some practical impediments, either when used over real-valued inputs or large finite fields, in most of distributed computing applications comprised of large number of local nodes. A sub-optimal yet practically interesting approach is to apply binary linear codes, consisting of ’s and ’s, and then perform the computation over real values. In this case, there is no need for the quantization as a zero in the -th element of the generator matrix of the binary linear code means that the -th task is not included in the -th encoded task sent to the -th node while a one means it is included. To this end, in this paper, we consider binary linear codes where all computations are performed over real-valued data inputs. A related work to this model is the very recent work in [bartan2019polar] where binary polar codes are applied for distributed matrix multiplication. The authors in [bartan2019polar] justify the application of binary codes over real-valued data and provide a decoding algorithm using polar decoder.

In this work, we connect the problem of characterizing the average execution time of any coded distributed computing system to the error probability of the underlying coding scheme over uses of erasure channels (see Lemma 1). Using this connection, we characterize the performance limits of distributed computing systems such as the average execution time that any linear code can achieve (see Theorem 2), the average job completion time using binary random linear codes (see Corollary 4), and the best achievable average execution time of any linear code (see Corollary 5) that can, provably, be attained using MDS codes requiring operations over large finite fields. Moreover, we study the gap between the average execution time of binary random linear codes and the optimal performance (see Theorem 7) showing the normalized gap approaches zero as (see Corollary 8). This implies that there exist binary linear codes that attain, asymptotically, the best performance any linear code, not necessarily binary, can achieve. We further study the performance of coded distributed computing systems using polar and Reed-Muller (RM) codes that can benefit from low-complexity decoding and superior performance, respectively.

## Ii System Model

We consider a distributed computing system consisting of local nodes with the same computational capabilities. The run time of each local node is modeled using a shifted-exponential random variable (RV), mainly adopted in the literature [liang2014tofec, reisizadeh2017coded, lee2018speeding]. Then, when the computational job is equally divided to tasks, the cumulative distribution function (CDF) of is given by

 Pr(Ti⩽t)=1−exp(−μ(kt−1)),\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak ∀\leavevmode\nobreak t⩾1/k, (1)

where is the exponential rate of each local node, also called the straggling parameter. Using (1) one can observe that the probability of the task assigned to the -th server not being completed (equivalent to erasure) until time is

 ϵ(t)≜Pr(Ti>t)=exp(−μ(kt−1)), (2)

and is one for . Therefore, given any time , the problem of computing parts of the computational job over servers can be interpreted as the traditional problem of transmitting symbols, using an code, over independent-and-identically-distributed (i.i.d.) erasure channels. Note that the form of the CDF in (1) suggests that is the (normalized) deterministic time required for each server to process its assigned portion of the total job (all tasks are erased before ), while any time elapsed after refers to the stochastic time as a result of servers’ statistical behavior (tasks are not completed with probability for ).

Given a certain code and a corresponding decoder over erasure channels, a decodable set of tasks refers to a pattern of unerased symbols resulting in a successful decoding with probability . Then, is defined as the probability of decoding failure over an erasure channel with erasure probability . For instance, for a code. Note that the reason to keep in the notation is to specify that the number of servers, when the code is used in distributed computation, is also . Finally, the total job completion time is defined as the time at which a decodable set of tasks/outputs is obtained from the servers.

## Iii Fundamental Limits

The following Lemma connects the average execution time of any linear-coded distributed computing system to the error probability of the underlying coding scheme over uses of an erasure channel.

###### Lemma 1.

The average execution time of a linear-coded distributed computing system using a given code can be characterized as

 Tavg≜E[T] =∫∞0Pe(ϵ(τ),n)dτ (3) =1k+1μk∫10Pe(ϵ,n)ϵdϵ, (4)

where is defined in (2).

###### Proof.

It is well-known that the expected value of any RV is related to its CDF as . Note that is the probability of the event that the job is not completed until some time . Therefore, using the system model in Section II, we can interpret as the probability of decoding failure of the code when used over i.i.d. erasure channels with the erasure probability . This completes the proof of (3). Now given that for the shifted-exponential distribution , and that for all , we have (4) by the change of variables. ∎

Remark 1. Note that (3) holds given any model for the distribution of the run time of the servers, while (4) is obtained under shifted-exponential distribution, with servers having a same straggling parameter , and can be extended to other distributions in a similar approach.

###### Theorem 2.

The average execution time of any linear-coded distributed computing system can be expressed as

 Tavg=1k[1+n∑i=n−k+11iμ]+1μkn−k∑i=11ip(i,k), (5)

where is the average conditional probability of decoding failure, for an underlying decoder, given encoded symbols are erased at random.

###### Proof.

Using the law of total probability and the definition of we have

 Pe(ϵ,n)=n∑i=1(ni)ϵi(1−ϵ)n−ip(i,k). (6)

Accordingly, characterizing requires computing integrals of the form for . Using part-by-part integration one can find the recursive relation which results in . Note that for , since one cannot extract the parts of the original job from less than encoded symbols. Then plugging (6) into (4) leads to (5). ∎

Next, we characterize the average execution time using a random ensemble of binary linear codes with full-rank generator matrices. This random ensemble, denoted by , is obtained by picking entries of the generator matrix independently and uniformly at random followed by removing those matrices not having a full row rank from the ensemble.

Remark 2. Note that (6) together with the integral form in (4) suggest that a coded computing system should always encode with a full-rank generator matrix, otherwise, the average execution time does not converge. This is the reason behind picking the particular ensemble described above. Note that this is in contrast with the conventional block coding, where we can get an arbitrarily small average probability of error over a random ensemble of all binary generator matrices.

###### Lemma 3.

The probability that the generator matrix of a code picked from does not remain full row rank after erasing columns uniformly at random, denoted by , can be expressed as

 pf(i,k)=1−∏kj=1(1−2j−1−n+i)∏kl=1(1−2l−1−n). (7)
###### Proof.

Define as the probability of binary uniform random vectors being linearly independent. It is well-known that

 l(m,k)=k∏i=1(1−2i−1−m). (8)

Let denote the matrix after removing columns of the generator matrix uniformly at random. Then

 pf(i,k) (9) =1−Pr({rank(~G)=k})Pr({rank(G)=k}) (10) =1−∏kj=1(1−2j−1−n+i)∏kl=1(1−2l−1−n), (11)

where (9) is by the definition of , (10) is by noting that , and (11) is by (8). ∎

###### Corollary 4.

The average execution time using random linear codes from the ensemble under maximum a posteriori (MAP) decoding is given by (5) while replacing in (5) by , characterized in Lemma 3.

###### Proof.

The proof is by noting that the optimal MAP decoder fails to recover the input symbols given unerased encoded symbols if and only if the corresponding sub-matrix of the generator matrix of the code is not full row rank which occurs with probability . ∎

Remark 3. Theorem 2 implies that the average execution time using linear codes consists of two terms. The first term is independent of the performance of the underlying coding scheme and is fixed given , , and . However, the second term is determined by the error performance of the coding scheme, i.e., for , and hence, can be minimized by properly designing the coding scheme.

The following corollary of Theorem 2 demonstrates that MDS codes, if they exist,111It is in general an open problem whether given , , and , there exists an MDS code over [macwilliams1977theory, Ch. 11.2]. are optimal in the sense that they minimize the average execution time by eliminating the second term of the right hand side in (5). However, for a large number of servers , the field size needs to be also large, e.g., for Reed-Solomon (RS) codes.

###### Corollary 5 (Optimality of MDS Codes).

For given ,, and underlying field size , an MDS code, if exists, achieves the minimum average execution time that can be attained by any linear code.

###### Proof.

MDS codes have the minimum distance of and can recover up to erasures leading to for . Therefore, the second term of (5) becomes zero for MDS codes and they achieve the following minimum average execution time that can be attained by any linear code:

 TMDSavg=1k+1μkn∑i=n−k+11i. (12)

Using Theorem 2 and Remark 3, and given that the generator matrix of any linear code with minimum distance remains full rank after removing up to any columns, we have the following proposition for the optimality criterion in terms of minimizing the average execution time.

###### Proposition 6 (Optimality Criterion).

An linear code that minimizes also minimizes the average execution time of a coded distributed computing system.

Although MDS codes meet the aforementioned optimality criterion over large field sizes, to the best of our knowledge, the optimal linear codes, given the field size and in particular for , per Proposition 6 are not known and have not been studied before, which calls for future studies.

In the following theorem we characterize the gap between the execution time of binary random linear codes and the optimal execution time. Then Corollary 8 proves that binary random linear codes asymptotically achieve the normalized optimal execution time, thereby demonstrating the existence of good binary codes for distributed computation over real-valued data. The reason we compare the normalized ’s instead of ’s is that, using (5), has a factor of and hence, for a fixed rate222More precisely, the coding rate over field size is equal to but with slight abuse of terminology we have dropped the factor of since this factor is not relevant for coded distributed computing. .

###### Theorem 7 (Gap of Binary Random Linear Codes to the Optimal Performance).

Let denote the average execution time of a coded distributed computing system using binary random linear codes. Then, for any given , , we have

 13μR(1−R)n<|nTMDSavg−nTBRCavg|<1μR× [v(n)n−k−v(n)+1+nR2−v(n)ln(n−k−v(n))], (13)

where is the rate and is an arbitrary function of with .

###### Proof.

Using Corollary 4 and Corollary 5, we have

 S≜μR|nTMDSavg−nTBRCavg|=n−k∑i=11ipf(i,k). (14)

The lower bound in (7) is by noting that

 S>pf(n−k,k)/(n−k),

where can be expressed as

 pf(n−k,k)=1−1−2−k1−2−n⋅1−2−k+11−2−n+1⋅...⋅1−2−11−2k−n−1. (15)

Note that for . For , since for , we have

 pf(n−k,k)>1−1−2−11−2k−n−1>1−1−2−11−2−1−1=13. (16)

Therefore, .

To prove the upper bound, the summation in (14) is split as where

 S1≜n−k∑i=n−k−v(n)+11ipf(i,k)

and

 S2≜n−k−v(n)∑i=11ipf(i,k). (18)

To upper-bound , we first note that , defined in (7), is a monotonically increasing function of . Then,

 S2 ⩽pf(n−k−v(n),k)n−k−v(n)∑i=11i (19)

We can further upper-bound as

 pf(n−k−v(n),k) <1−k∏j=1(1−2j−1−k−v(n)) (21) <1−[1−2−v(n)]k (22) ⩽nR2−v(n), (23)

where (21) is by (7) together with , (22) follows by noting that

 k∏j=1(1−2j−1−k−v(n))=k∏j′=1(1−2−j′−v(n))>[1−2−v(n)]k,

and (23) follows by Bernoulli’s inequality for any and then inserting . ∎

###### Corollary 8 (Asymptotic Optimality of Binary Random Linear Codes).

The normalized average execution time approaches as grows large. More precisely, for a given rate , there exist constants such that for sufficiently large , i.e., , we have

 c11n⩽|nTMDSavg−nTBRCavg|⩽c2log2nn. (24)
###### Proof.

The lower bound holds with according to the left hand side of (7). Observe that with the choice of both terms in the right hand side of (7) become . Note that , for sufficiently large . Hence, the upper bound of (24) also holds with a proper choice of . ∎

Remark 4. Using (12) and a similar approach to [lee2018speeding], one can show that the asymptotically-optimal encoding rate for an MDS-coded distributed computing system is the solution to

 (1−R∗)ln(1−R∗)=μ(1−R∗)−R∗. (25)

Corollary 8 implies that for distributed computation using binary random linear codes, the gap of to converges to zero as grows large. Accordingly, the optimal encoding rate also approaches , described in (25).

## Iv Practical Codes and Simulation Results

In this section, simulation results for the expected-time performance of various coding schemes over distributed computing systems are presented. In particular, their gap to the optimal performance are shown and also, their performance gains are compared with the uncoded computation.

### Iv-a Polar-Coded Distributed Computation

Binary polar codes are capacity-achieving linear codes with explicit constructions and low-complexity encoding and decoding [Arikan]. Also, the low-complexity encoding and decoding of polar codes can be adapted to work over real-valued data when dealing with erasures as in coded computation systems, as also noted in [bartan2019polar]. Next, we briefly explain the encoding and decoding procedure of real-valued data using binary polar codes and delineate how we can obtain the average execution times using Lemma 1.

#### Iv-A1 Encoding Procedure

Arıkan’s polarization matrix is considered, where and denotes the -th Kronecker power of . Next, a design parameter is picked, as specified later in Section IV-C. Then the polarization transform is applied to a binary erasure channel with erasure probability , BEC. The erasure probabilities of polarized bit-channels, denoted by , are sorted and the rows of corresponding to the indices of the smallest ’s are picked to construct the generator matrix . The encoding procedure using the resulting generator matrix , which also applies to any binary linear code operating over real-valued data, is as follows. First, the computational job is divided into smaller tasks. Then the -th encoded task which will be sent to the -th node, for , is the sum of all tasks ’s for which the -th element of is .

#### Iv-A2 Decoding Procedure

The recursive structure of polar codes can be applied for low-complexity detection/decoding of real-valued data using parallel processing for more speedups [jamali2018low]. It is well-known that in the case of successive cancellation (SC) decoding over BECs, the probability of decoding failure of polar codes is , where denotes the set of indices of the selected rows.

Remark 5. Since polar SC decoder is sub-optimal in terms of successful decoding performance, one can think of optimal maximum-likelihood (ML) decoder to attain a lower failure probability at the cost of higher complexities. Consequently, investigating the possibility of attaining close-to-ML performance, e.g., using SC list decoding of polar codes [tal2015list], over real-valued data is an interesting problem deserving future studies when taking into account all time-consuming components of a coded distributed computing system.

#### Iv-A3 Performance Characterization

Given the decoding method adopted we can find the average execution time using Lemma 1. In particular, when SC decoding is adopted, can be obtained by numerically evaluating the integral of (4) involving . Moreover, for the ML decoding, we first estimate the error probability using Monte-Carlo (MC) simulations and then apply (4).

### Iv-B RM-Coded Distributed Computation

RM codes are closely related to polar codes, where for an RM code the generator matrix is constructed by choosing the rows of (defined in Section IV-A1) having the largest Hamming weights. It is recently shown that RM codes are capacity achieving over BECs [kudekar2017reed], though under bit-MAP decoding, and numerical results suggest that they actually achieve the capacity with almost optimal scaling [hassani2018almost]. There is still a considerable interest in constructing low-complexity decoding algorithms for RM codes attaining such performances. In this paper, we apply the MC-based simulation to estimate for RM codes with the optimal ML decoder, and then evaluate their execution time, numerically, using (4). The inspiration behind considering RM codes in this paper is that they are believed to have the almost optimal scaling which, we conjecture, is sufficient for asymptotic optimality, similar to random linear codes in Corollary 8, for coded distributed computing. The simulation results, provided next, support this conjecture.

### Iv-C Simulation Results

Numerical results for the performance of the coded distributed computing systems utilizing MDS codes, binary random linear codes, polar codes, and RM codes, are presented in Table I and are compared with the uncoded scenario over small block-lengths. We assume for all numerical results in this section. For MDS and random linear codes, is calculated using (12) and Corollary 4, respectively, and for polar and RM codes, it is numerically evaluated using (4) as discussed in Sections IV-A and IV-B. Then is obtained by minimizing for all possible values of . We designed the polar code with , which is observed to be good enough for this range of block-lengths but one can also attain slightly better performance for polar codes by optimizing over specifically for each . Characterizing the best as a function of block-length is left for the future work. In Table I, is defined as the percentage of the gain in compared to the uncoded scenario and is defined as the gap of for the underlying coding scheme to that of MDS codes, in percentage. Intuitively, for a coding scheme determines how much gain this scheme attains and indicates how close this scheme is to the optimal solution. Observe that polar codes with the low-complexity SC decoder achieve large enough ’s, close to the optimal values of , e.g., for versus for the MDS code. Closer performance to the optimal can be obtained by decoding polar codes with ML decoder, e.g., for . Figure 1 shows that random linear codes have weak performance in the beginning but they quickly approach the optimal so that they have small gaps to the optimal values, e.g., for . Also, observe that RM codes always outperform polar codes since, perhaps, they have better distance distribution leading to better ’s defined in Theorem 2.

In the case of , by numerically solving (25), we have for the asymptotically-optimal encoding rate . Motivated by this fact, in Figure 2, the rate of all discussed underlying coding schemes is fixed to and is plotted for moderately large block-lengths, i.e., is not optimized over rates for the results demonstrated in this plot. Additionally, the polar code is designed with , which makes the code to be capacity-achieving for an erasure channel with capacity equal to . Note that there is still a gap between polar codes with ML decoder and MDS codes. We believe this is due to the fact that binary polar codes with the polarization kernel do not have an optimal scaling exponent [hassani2014finite]. Furthermore, Figure 2 suggests that RM codes attain the optimal performance, and also do so relatively fast, supporting our conjecture in Section IV-B.

## V Conclusion

In this paper, we presented a coding-theoretic approach toward coded distributed computing systems by connecting the problem of characterizing their average execution time to the traditional problem of finding the error probability of a coding scheme over erasure channels. Using this connection, we provided results on the performance of coded distributed computing systems, such as their best performance bounds and asymptotic results using binary random linear codes. We further analyzed the performance of polar and RM codes in the context of distributed computing systems. We conjecture that achieving the capacity of BECs with optimal scaling exponent is a sufficient condition for binary codes to be asymptotically optimal, in the sense defined in Theorem 7. We have shown this for binary random linear codes which are well-known to have optimal scaling exponent, even with sparse generator matrices [mahdavifar2017scaling], and numerically verified this for RM codes by observing that they attain close to optimal performance using a moderate number of servers. It is also interesting to see whether having an optimal scaling exponent is also a necessary condition for codes to be asymptotically optimal, e.g., whether binary polar codes with the polarization kernel are asymptotically optimal or not.

## References

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters