A Fundamental Tradeoff between Computation and Communication in Distributed Computing

# A Fundamental Tradeoff between Computation and Communication in Distributed Computing

Songze Li,  Mohammad Ali Maddah-Ali,  Qian Yu,  and A. Salman Avestimehr,  S. Li, Q. Yu and A.S. Avestimehr are with the Department of Electrical Engineering, University of Southern California, Los Angeles, CA, 90089, USA (e-mail: songzeli@usc.edu; qyu880@usc.edu; avestimehr@ee.usc.edu).M. A. Maddah-Ali is with Department of Electrical Engineering, Sharif University of Technology, Tehran, 11365, Iran (e-mail: maddah_ali@sharif.edu).A preliminary part of this work was presented in 53rd Annual Allerton Conference on Communication, Control, and Computing, 2015 [1]. A part of this work was presented in IEEE International Symposium on Information Theory, 2016 [2]. A part of this work was presented in the 6th International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics, 2017 [3].This work is in part supported by NSF grants CCF-1408639, NETS-1419632, ONR award N000141612189, NSA Award No. H98230-16-C-0255, and a research gift from Intel. This material is based upon work supported by Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001117C0053. The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.
###### Abstract

How can we optimally trade extra computing power to reduce the communication load in distributed computing? We answer this question by characterizing a fundamental tradeoff between computation and communication in distributed computing, i.e., the two are inversely proportional to each other.

More specifically, a general distributed computing framework, motivated by commonly used structures like MapReduce, is considered, where the overall computation is decomposed into computing a set of “Map” and “Reduce” functions distributedly across multiple computing nodes. A coded scheme, named “Coded Distributed Computing” (CDC), is proposed to demonstrate that increasing the computation load of the Map functions by a factor of (i.e., evaluating each function at carefully chosen nodes) can create novel coding opportunities that reduce the communication load by the same factor.

An information-theoretic lower bound on the communication load is also provided, which matches the communication load achieved by the CDC scheme. As a result, the optimal computation-communication tradeoff in distributed computing is exactly characterized.

Finally, the coding techniques of CDC is applied to the Hadoop TeraSort benchmark to develop a novel CodedTeraSort algorithm, which is empirically demonstrated to speed up the overall job execution by - , for typical settings of interest.

Distributed Computing, MapReduce, Computation-Communication Tradeoff, Coded Multicasting, Coded TeraSort

## I Introduction

We consider a general distributed computing framework, motivated by prevalent structures like MapReduce [4] and Spark [5], in which the overall computation is decomposed into two stages: “Map” and “Reduce”. Firstly in the Map stage, distributed computing nodes process parts of the input data locally, generating some intermediate values according to their designed Map functions. Next, they exchange the calculated intermediate values among each other (a.k.a. data shuffling), in order to calculate the final output results distributedly using their designed Reduce functions.

Within this framework, data shuffling often appears to limit the performance of distributed computing applications, including self-join [6], tera-sort [7], and machine learning algorithms [8]. For example, in a Facebook’s Hadoop cluster, it is observed that 33% of the overall job execution time is spent on data shuffling [8]. Also as is observed in [9], 70% of the overall job execution time is spent on data shuffling when running a self-join application on an Amazon EC2 cluster [10]. As such motivated, we ask this fundamental question that if coding can help distributed computing in reducing the load of communication and speeding up the overall computation? Coding is known to be helpful in coping with the channel uncertainty in telecommunication and also in reducing the storage cost in distributed storage systems and cache networks. In this work, we extend the application of coding to distributed computing and propose a framework to substantially reduce the load of data shuffling via coding and some extra computing in the Map phase.

More specifically, we formulate and characterize a fundamental tradeoff relationship between “computation load” in the Map phase and “communication load” in the data shuffling phase, and demonstrate that the two are inversely proportional to each other. We propose an optimal coded scheme, named “Coded Distributed Computing” (CDC), which demonstrates that increasing the computation load of the Map phase by a factor of (i.e., evaluating each Map function at carefully chosen nodes) can create novel coding opportunities in the data shuffling phase that reduce the communication load by the same factor.

To illustrate our main result, consider a distributed computing framework to compute arbitrary output functions from input files, using distributed computing nodes. As mentioned earlier, the overall computation is performed by computing a set of Map and Reduce functions distributedly across the nodes. In the Map phase, each input file is processed locally, in one of the nodes, to generate intermediate values, each corresponding to one of the output functions. Thus, at the end of this phase, intermediate values are calculated, which can be split into subsets of intermediate values and each subset is needed to calculate one of the output functions. In the Shuffle phase, for every output function to be calculated, all intermediate values corresponding to that function are transferred to one of the nodes for reduction. Of course, depending on the node that has been chosen to reduce an output function, a part of the intermediate values are already available locally, and do not need to be transferred in the Shuffle phase. This is because that the Map phase has been carried out on the same set of nodes, and the results of mapping done at a node can remain in that node to be used for the Reduce phase. This offers some saving in the load of communication. To reduce the communication load even more, we may map each input file in more than one nodes. Apparently, this increases the fraction of intermediate values that are locally available. However, as we will show, there is a better way to exploit this redundancy in computation to reduce the communication load. The main message of this paper is to show that following a particular patten in repeating Map computations along with some coding techniques, we can significantly reduce the load of communication. Perhaps surprisingly, we show that the gain of coding in reducing communication load scales with the size of the network.

To be more precise, we define the computation load , , as the total number of computed Map functions at the nodes, normalized by . For example, means that none of the Map functions has been re-computed, and means that on average each Map function can be computed on two nodes. We also define communication load , , as the total amount of information exchanged across nodes in the shuffling phase, normalized by the size of intermediate values, in order to compute the output functions disjointly and uniformly across the nodes. Based on this formulation, we now ask the following fundamental question:

• Given a computation load in the Map phase, what is the minimum communication load , using any data shuffling scheme, needed to compute the final output functions?

We propose Coded Distributed Computing (CDC) that achieves a communication load of for , and the lower convex envelop of these points. CDC employs a specific strategy to assign the computations of the Map and Reduce functions across the computing nodes, in order to enable novel coding opportunities for data shuffling. In particular, for a computation load , CDC utilizes a carefully designed repetitive mapping of data blocks at distinct nodes to create coded multicast messages that deliver data simultaneously to a subset of nodes. Hence, compared with an uncoded data shuffling scheme, which as we show later achieves a communication load , CDC is able to reduce the communication load by exactly a factor of the computation load . Furthermore, the proposed CDC scheme applies to a more general distributed computing framework where every output function is computed by more than one, or particularly nodes, which provides better fault-tolerance in distributed computing.

We numerically compare the computation-communication tradeoffs of CDC and uncoded data shuffling schemes (i.e., and ) in Fig. 1. As it is illustrated, in the uncoded scheme that achieves a communication load , increasing the computation load offers only a modest reduction in communication load. In fact for any , this gain vanishes for large number of nodes . Consequently, it is not justified to trade computation for communication using uncoded schemes. However, for the coded scheme that achieves a communication load of , increasing the computation load will significantly reduce the communication load, and this gain does not vanish for large . For example as illustrated in Fig. 1, when mapping each file at one extra node (), CDC reduces the communication load by 55.6%, while the uncoded scheme only reduces it by 11.1%.

We also prove an information-theoretic lower bound on the minimum communication load . To prove the lower bound, we derive a lower bound on the total number of bits communicated by any subset of nodes, using induction on the size of the subset. To derive the lower bound for a particular subset of nodes, we first establish a lower bound on the number of bits needed by one of the nodes to recover the intermediate values it needs to calculate its assigned output functions, and then utilize the bound on the number of bits communicated by the rest of the nodes in that subset, which is given by the inductive argument. The derived lower bound on matches the communication load achieved by the CDC scheme for any computation load . As a result, we exactly characterize the optimal tradeoff between computation load and communication load in the following:

 L∗(r)=L\textupcoded(r)=1r⋅(1−rK),r∈{1,…,K}.

For general , is the lower convex envelop of the above points . Note that for large , , hence . This result reveals a fundamental inversely proportional relationship between computation load and communication load in distributed computing. This also illustrates that the gain of achieved by CDC is optimal and it cannot be improved by any other scheme (since is an information-theoretic lower bound on that applies to any data shuffling scheme).

Having theoretically characterized the optimal computation-communication tradeoff achieved by the proposed CDC scheme, we also empirically demonstrate the practical impact of this tradeoff. In particular, we apply the coding techniques of CDC to a widely used Hadoop sorting benchmark TeraSort [11], developing a novel coded distributed sorting algorithm CodedTeraSort [3]. We perform extensive experiments on Amazon EC2 clusters, and observe that for typical settings of interest, CodedTeraSort speeds up the overall execution of the conventional TeraSort by a factor of - .

Finally, we discuss some future directions to extend the results of this work. In particular, we consider topics including heterogeneous networks with asymmetric tasks, straggling/failing computing nodes, multi-stage computation tasks, multi-layer networks and structured topology, joint storage and computation optimization, and coded edge/fog computing.

Related Works. The problem of characterizing the minimum communication for distributed computing has been previously considered in several settings in both computer science and information theory communities. In [12], a basic computing model is proposed, where two parities have and and aim to compute a boolean function by exchanging the minimum number of bits between them. Also, the problem of minimizing the required communication for computing the modulo-two sum of distributed binary sources with symmetric joint distribution was introduced in [13]. Following these two seminal works, a wide range of communication problems in the scope of distributed computing have been studied (see, e.g., [14, 15, 16, 17, 18, 19]). The key differences distinguishing the setting in this paper from most of the prior ones are 1) We focus on the flow of communication in a general distributed computing framework, motivated by MapReduce, rather than the structures of the functions or the input distributions. 2) We do not impose any constraint on the numbers of output results, input data files and computing nodes (they can be arbitrarily large), 3) We do not assume any special property (e.g. linearity) of the computed functions.

The idea of efficiently creating and exploiting coded multicasting was initially proposed in the context of cache networks in [20, 21], and extended in [22, 23], where caches pre-fetch part of the content in a way to enable coding during the content delivery, minimizing the network traffic. In this paper, we propose a framework to study the tradeoff between computation and communication in distributed computing. We demonstrate that the coded multicasting opportunities exploited in the above caching problems also exist in the data shuffling of distributed computing frameworks, which can be created by a strategy of repeating the computations of the Map functions specified by the Coded Distributed Computing (CDC) scheme.

Finally, in a recent work [24], the authors have proposed methods for utilizing codes to speed up some specific distributed machine learning algorithms. The considered problem in this paper differs from [24] in the following aspects. We propose a general methodology for utilizing coding in data shuffling that can be applied to any distributed computing framework with a MapReduce structure, regardless of the underlying application. In other words, any distributed computing algorithm that fits in the MapReduce framework can benefit from the proposed CDC solution. We also characterize the information-theoretic computation-communication tradeoff in such frameworks. Furthermore, the coding used in [24] is at the application layer (i.e., applying computation on coded data), while in this paper we focus on applying codes directly on the shuffled data.

## Ii Problem Formulation

In this section, we formulate a general distributed computing framework motivated by MapReduce, and define the function characterizing the tradeoff between computation and communication.

We consider the problem of computing arbitrary output functions from input files using a cluster of distributed computing nodes (servers), for some positive integers , with . More specifically, given input files , for some , the goal is to compute output functions , where , maps all input files to a length- binary stream , for some .

Motivated by MapReduce, we assume that as illustrated in Fig. 2 the computation of the output function , can be decomposed as follows:

 ϕq(w1,…,wN)=hq(gq,1(w1),…,gq,N(wN)), (1)

where

• The “Map” functions , maps the input file into length- intermediate values , , for some .111When mapping a file, we compute intermediate values in parallel, one for each of the output functions. The main reason to do this is that parallel processing can be efficiently performed for applications that fit into the MapReduce framework. In other words, mapping a file according to one function is only marginally more expensive than mapping according to all functions. For example, for the canonical Word Count job, while we are scanning a document to count the number of appearances of one word, we can simultaneously count the numbers of appearances of other words with marginally increased computation cost.

• The “Reduce” functions , maps the intermediate values of the output function in all input files into the output value .

###### Remark 1.

Note that for every set of output functions such a Map-Reduce decomposition exists (e.g., setting to identity functions such that for all , and to in (1)). However, such a decomposition is not unique, and in the distributed computing literature, there has been quite some work on developing appropriate decompositions of computations like join, sorting and matrix multiplication (see, e.g., [4, 25]), for them to be performed efficiently in a distributed manner. Here we do not impose any constraint on how the Map and Reduce functions are chosen (for example, they can be arbitrary linear or non-linear functions).

The above computation is carried out by distributed computing nodes, labelled as Node Node . They are interconnected through a multicast network. Following the above decomposition, the computation proceeds in three phases: Map, Shuffle and Reduce.

Map Phase: Node , computes the Map functions of a set of files , which are stored on Node , for some design parameter . For each file in , Node computes . We assume that each file is mapped by at least one node, i.e., .

###### Definition 1 (Computation Load).

We define the computation load, denoted by , , as the total number of Map functions computed across the nodes, normalized by the number of files , i.e., . The computation load can be interpreted as the average number of nodes that map each file.

Shuffle Phase: Node , is responsible for computing a subset of output functions, whose indices are denoted by a set . We focus on the case , and utilize a symmetric task assignment across the nodes to maintain load balance. More precisely, we require 1) , 2) for all .

###### Remark 2.

Beyond the symmetric task assignment considered in this paper, characterizing the optimal computation-communication tradeoff allowing general asymmetric task assignments is a challenging open problem. As the first step to study this problem, in our follow-up work [26] in which the number of output functions is fixed and the computing resources are abundant (e.g., number of computing nodes ), we have shown that asymmetric task assignments can do better than the symmetric ones, and achieve the optimum run-time performance.

To compute the output value for some , Node needs the intermediate values that are not computed locally in the Map phase, i.e., . After Node , has finished mapping all the files in , the nodes proceed to exchange the needed intermediate values. In particular, each node , , creates an input symbol , for some , as a function of the intermediate values computed locally during the Map phase, i.e., for some encoding function at Node , we have

 Xk=ψk({→gn:wn∈Mk}). (2)

Having generated the message , Node  multicasts it to all other nodes.

By the end of the Shuffle phase, each of the nodes receives free of error.

###### Definition 2 (Communication Load).

We define the communication load, denoted by , , as . That is, represents the (normalized) total number of bits communicated by the nodes during the Shuffle phase.222For notational convenience, we define all variables in binary extension fields. However, one can consider arbitrary field sizes. For example, we can consider all intermediate values , , , to be in the field , for some prime number and positive integer , and the symbol communicated by Node  (i.e., ), to be in the field for some prime number and positive integer , for all . In this case, the communication load can be defined as .

Reduce Phase: Node , , uses the messages communicated in the Shuffle phase, and the local results from the Map phase to construct inputs to the corresponding Reduce functions of , i.e., for each and some decoding function , Node computes

 (vq,1,…,vq,N)=χqk(X1,…,XK,{→gn:wn∈Mk}). (3)

Finally, Node , , computes the Reduce function for all .

We say that a computation-communication pair is feasible if for any and sufficiently large , there exist , , a set of encoding functions , and a set of decoding functions that achieve a computation-communication pair such that , , and Node can successfully compute all the output functions whose indices are in , for all .

###### Definition 3.

We define the computation-communication function of the distributed computing framework

 L∗(r)≜inf{L:(r,L) is feasible}. (4)

characterizes the optimal tradeoff between computation and communication in this framework.

Example (Uncoded Scheme). In the Shuffle phase of a simple “uncoded” scheme, each node receives the needed intermediate values sent uncodedly by some other nodes. Since a total of intermediate values are needed across the nodes and of them are already available after the Map phase, the communication load achieved by the uncoded scheme

 Luncoded(r)=1−r/K. (5)
###### Remark 3.

After the Map phase, each node knows the intermediate values of all output functions in the files it has mapped. Therefore, for a fixed file assignment and any symmetric assignment of the Reduce functions, specified by , we can satisfy the data requirements using the same data shuffling scheme up to relabelling the Reduce functions. In other words, the communication load is independent of the assignment of the Reduce functions.

In this paper, we also consider a generalization of the above framework, which we call “cascaded distributed computing framework”, where after the Map phase, each Reduce function is computed by more than one, or particularly nodes, for some . This generalized model is motivated by the fact that many distributed computing jobs require multiple rounds of Map and Reduce computations, where the Reduce results of the previous round serve as the inputs to the Map functions of the next round. Computing each Reduce function at more than one node admits data redundancy for the subsequent Map-function computations, which can help to improve the fault-tolerance and reduce the communication load of the next-round data shuffling. We focus on the case , and enforce a symmetric assignment of the Reduce tasks to maintain load balance. Particularly, we require that every subset of nodes compute a disjoint subset of Reduce functions.

The feasible computation-communication triple is defined similar as before. We define the computation-communication function of the cascaded distributed computing framework

 L∗(r,s)≜inf{L:(r,s,L) is feasible}. (6)

## Iii Main Results

###### Theorem 1.

The computation-communication function of the distributed computing framework, is given by

 L∗(r)=L\textupcoded(r)≜1r⋅(1−rK),r∈{1,…,K}, (7)

for sufficiently large . For general , is the lower convex envelop of the above points .

We prove the achievability of Theorem 1 by proposing a coded scheme, named Coded Distributed Computing, in Section V. We demonstrate that no other scheme can achieve a communication load smaller than the lower convex envelop of the points by proving the converse in Section VI.

###### Remark 4.

Theorem 1 exactly characterizes the optimal tradeoff between the computation load and the communication load in the considered distributed computing framework.

###### Remark 5.

For , the communication load achieved in Theorem 1 is less than that of the uncoded scheme in (5) by a multiplicative factor of , which equals the computation load and can grow unboundedly as the number of nodes increases if e.g. . As illustrated in Fig. 1 in Section I, while the communication load of the uncoded scheme decreases linearly as the computation load increases, achieved in Theorem 1 is inversely proportional to the computation load.

###### Remark 6.

While increasing the computation load causes a longer Map phase, the coded achievable scheme of Theorem 1 maximizes the reduction of the communication load using the extra computations. Therefore, Theorem 1 provides an analytical framework to optimally trading the computation power in the Map phase for more bandwidth in the Shuffle phase, which helps to minimize the overall execution time of applications whose performances are limited by data shuffling.

###### Theorem 2.

The computation-communication function of the cascaded distributed computing framework, , for , is characterized by

 L∗(r,s)=L\textupcoded(r,s)≜min{r+s,K}∑ℓ=max{r+1,s}ℓ(Kℓ)(ℓ−2r−1)(rℓ−s)r(Kr)(Ks), (8)

for some and sufficiently large . For general , is the lower convex envelop of the above points .

We present the Coded Distributed Computing scheme that achieves the computation-communication function in Theorem 2 in Section V, and the converse of Theorem 2 in Section VII.

###### Remark 7.

A preliminary part of this result, in particular the achievability for the special case of , or the achievable scheme of Theorem 1 was presented in [1]. We note that when , Theorem 2 provides the same result as in Theorem 1, i.e., , for .

###### Remark 8.

For any fixed (number of nodes that compute each Reduce function), as illustrated in Fig. 3, the communication load achieved in Theorem 2 outperforms the linear relationship between computation and communication, i.e., it is superlinear with respect to the computation load .

Before we proceed to describe the general achievability scheme for the cascaded distributed computing framework (also the distributed computing framework as a special case of ), we first illustrate the key ideas of the proposed Coded Distributed Computing scheme by presenting two examples in the next section, for the cases of and respectively.

## Iv Illustrative Examples: Coded Distributed Computing

In this section, we present two illustrative examples of the proposed achievable scheme for Theorem 1 and Theorem 2, which we call Coded Distributed Computing (CDC), for the cases of (Theorem 1) and (Theorem 2) respectively.

###### Example 1 (CDC for s=1).

We consider a MapReduce-type problem in Fig. 4 for distributed computing of output functions, represented by red/circle, green/square, and blue/triangle respectively, from input files, using computing nodes. Nodes , , and are respectively responsible for final reduction of red/circle, green/square, and blue/triangle output functions. Let us first consider the case where no redundancy is imposed on the computations, i.e., each file is mapped once and computation load . As shown in Fig. 4(a), Node maps File and File for . In this case, each node maps input files locally, computing all three intermediate values needed for the three output functions from each mapped file. In Fig. 4, we represent, for example, the intermediate value of the red/circle function in File using a red circle labelled by , for all . Similar representations follow for the green/square and the blue/triangle functions. After the Map phase, each node obtains out of required intermediate values to reduce the output function it is responsible for (e.g., Node 1 knows the red circles in File 1 and File 2). Hence, each node needs intermediate values from the other nodes, yielding a communication load of .

Now, we demonstrate how the proposed CDC scheme trades the computation load to slash the communication load via in-network coding. As shown in Fig. 4(b), we double the computation load such that each file is now mapped on two nodes (). It is apparent that since more local computations are performed, each node now only requires other intermediate values, and an uncoded shuffling scheme would achieve a communication load of . However, we can do much better with coding. As shown in Fig. 4(b), instead of unicasting individual intermediate values, every node multicasts a bit-wise XOR, denoted by , of locally computed intermediate values to the other two nodes, simultaneously satisfying their data demands. For example, knowing the blue/triangle in File , Node  can cancel it from the coded packet sent by Node , recovering the needed green/square in File . Therefore, this coding incurs a communication load of , achieving a gain from the uncoded shuffling.

From the above example, we see that for the case of , i.e., each of the output functions is computed on one node and the computations of the Reduce functions are symmetrically distributed across nodes, the proposed CDC scheme only requires performing bit-wise XOR as the encoding and decoding operations. However, for the case of , as we will show in the following example, the proposed CDC scheme requires computing linear combinations of the intermediate values during the encoding process.

###### Example 2 (CDC for s>1).

In this example, we consider a job of computing output functions from input files, using nodes. We focus on the case where the computation load , and each Reduce function is computed by nodes. In the Map phase, each file is mapped by nodes. As shown in Fig. 5, the sets of the files mapped by the nodes are , , , and . After the Map phase, Node , , knows the intermediate values of all output functions in the files in , i.e., . In the Reduce phase, we assign the computations of the Reduce functions in a symmetric manner such that every subset of nodes compute a common Reduce function. More specifically as shown in Fig. 5, the sets of indices of the Reduce functions computed by the nodes are , , , and . Therefore, for example, Node 1 still needs the intermediate values through data shuffling to compute its assigned Reduce functions , , .

The data shuffling process consists of two rounds of communication over the multicast network. In the first round, intermediate values are communicated within each subset of nodes. In the second round, intermediate values are communicated within the set of all nodes. In what follows, we describe these two rounds of communication respectively.

Round 1: Subsets of nodes. We first consider the subset . During the data shuffling, each node whose index is in multicasts a bit-wise XOR of two locally computed intermediate values to the other two nodes:

• Node 1 multicasts to Node  and Node ,

• Node 2 multicasts to Node  and Node ,

• Node 3 mulicasts to Node  and Node ,

Since Node 2 knows and Node 3 knows locally, they can respectively decode and from the coded message .

We employ the similar coded shuffling scheme on the other 3 subsets of 3 nodes. After the first round of shuffling,

• Node 1 recovers , and ,

• Node 2 recovers , and ,

• Node 3 recovers , and ,

• Node 4 recovers , and .

Round 2: All nodes. We first split each of the intermediate values , , , , , and into two equal-sized segments each containing bits, which are denoted by and for an intermediate value . Then, for some coefficients , Node  multicasts the following two linear combinations of three locally computed segments to the other three nodes.

 v(1)4,3+v(1)5,2+v(1)6,1, (9) α1v(1)4,3+α2v(1)5,2+α3v(1)6,1. (10)

Similarly, as shown in Fig. 5, each of Node , Node , and Node  multicasts two linear combinations of three locally computed segments to the other three nodes, using the same coefficients , , and .

Having received the above two linear combinations, each of Node , Node , and Node  first subtracts out one segment available locally from the combinations, or more specifically, for Node , for Node , and for Node . After the subtraction, each of these three nodes recovers the required segments from the two linear combinations. More specifically, Node 2 recovers and , Node 3 recovers and , and Node 4 recovers and . It is not difficult to see that the above decoding process is guaranteed to be successful if , , and are all distinct from each other, which requires the field size (e.g., ). Following the similar procedure, each node recovers the required segments from the linear combinations multicast by the other three nodes. More specifically, after the second round of data shuffling,

• Node 1 recovers , and ,

• Node 2 recovers , and ,

• Node 3 recovers , and ,

• Node 4 recovers , and .

We finally note that in the second round of data shuffling, each linear combination multicast by a node is simultaneously useful for the rest of the three nodes.

## V General Achievable Scheme: Coded Distributed Computing

In this section, we formally prove the upper bounds in Theorem 1 and 2 by presenting and analyzing the Coded Distributed Computing (CDC) scheme. We focus on the more general case considered in Theorem 2 with , and the scheme for Theorem 1 simply follows by setting .

We first consider the integer-valued computation load , and then generalize the CDC scheme for any . When , every node can map all the input files and compute all the output functions locally, thus no communication is needed and for all . In what follows, we focus on the case where .

We consider sufficiently large number of input files , and , for some . We first inject empty files into the system to obtain a total of files, which is now a multiple of of . We note that . Next, we proceed to present the achievable scheme for a system with input files .

### V-a Map Phase Design

In the Map phase the input files are evenly partitioned into disjoint batches of size , each corresponding to a subset of size , i.e.,

 {w1,…,w¯N}=∪T⊂{1,…,K},|T|=rBT, (11)

where denotes the batch of files corresponding to the subset .

Given this partition, Node , , computes the Map functions of the files in if . Or equivalently, if . Since each node is in subsets of size , each node computes Map functions, i.e., for all . After the Map phase, Node , , knows the intermediate values of all output functions in the files in , i.e., .

### V-B Coded Data Shuffling

We recall that we focus on the case where the number of the output functions satisfies , and enforce a symmetric assignment of the Reduce functions such that every subset of nodes reduce functions. Specifically, for some , and the computations of the Reduce functions are assigned symmetrically across the nodes as follows. Firstly the Reduce functions are evenly partitioned into disjoint batches of size , each corresponding to a unique subset of nodes, i.e.,

 {1,…,Q}=∪P⊆{1,…,K},|P|=sDP, (12)

where denotes the indices of the batch of Reduce functions corresponding to the subset .

Given this partition, Node , , computes the Reduce functions whose indices are in if . Or equivalently, if . As a result, each node computes Reduce functions, i.e., for all .

For a subset of and with , we denote the set of intermediate values needed by all nodes in , no node outside , and known exclusively by nodes in as . More formally:

 VS∖S1S1≜{vq,n: q∈∩k∈S∖S1Wk,q∉∪k∉SWk, wn∈∩k∈S1Mk,wn∉∪k∉S1Mk}. (13)

We observe that the set defined above contains intermediate values of output functions. This is because that the output functions whose intermediate values are included in should be computed exclusively by the nodes in and a subset of nodes in . Therefore, contains the intermediate values of a total of output functions. Since every subset of nodes map a unique batch of files, contains intermediate values.

Next, we first concatenate all intermediate values in to construct a symbol . Then for , we arbitrarily and evenly split into segments, each containing bits, i.e.,

 US∖S1S1=(US∖S1S1,σ1,US∖S1S1,σ2,…,US∖S1S1,σr), (14)

where denotes the segment associated with Node .

For each , there are a total of subsets of with size that contain the element . We index these subsets as . Within a subset , the segment associated with Node is , for all . We note that each segment , , is known by all nodes whose indices are in , and needed by all nodes whose indices are in .

#### V-B1 Encoding

The shuffling scheme of CDC consists of multiple rounds, each corresponding to all subsets of the nodes with a particular size. Within each subset, each node multicasts linear combinations of the segments that are associated with it to the other nodes in the subset. More specifically, for each subset of size , we define and . Then for each , Node computes message symbols, denoted by as follows. For some coefficients where for all , Node  computes

 XSk[1] =US∖S(k)[1]S(k)[1],k+US∖S(k)[2]S(k)[2],k+⋯+US∖S(k)[n1]S(k)[n1],k, (15) XSk[2] =α1US∖S(k)[1]S(k)[1],k+α2US∖S(k)[2]S(k)[2],k+⋯+αn1US∖S(k)[n1]S(k)[n1],k, ⋮ XSk[n2] =αn2−11US∖S(k)[1]S(k)[1],k+αn2−12US∖S(k)[2]S(k)[2],k +⋯+αn2−1n1US∖S(k)[n1]S(k)[n1],k,

or equivalently,

 (16)

We note that the above encoding process is the same at all nodes whose indices are in , i.e., each of them multiplies the same matrix in (16) with the segments associated with it.

Having generated the above message symbols, Node  multicasts them to the other nodes whose indices are in .

###### Remark 9.

When , i.e., every output function is computed by one node, the above shuffling scheme only takes one round for all subsets of size . Instead of multicasting linear combinations, every node in can simply multicast the bit-wise XOR of its associated segments to the other nodes in .

#### V-B2 Decoding

For and , there are a total of subsets of that have size and simultaneously contain and . Hence, among all segments associated with Node , of them are already known at Node , and the rest of segments are needed by Node . We denote the indices of the subsets that contain the element but not the element as , such that , and for all .

After receiving the symbols from Node , Node  first removes the locally known segments from the linear combinations to generate symbols , such that

 ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣YSjk[1]YSjk[2]⋮YSjk[n2]⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣11⋯1αb1jkαb2jk⋯αb