“ShortDot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products
Abstract
Faced with saturation of Moore’s law and increasing size and dimension of data, system designers have increasingly resorted to parallel and distributed computing to reduce computation time of machinelearning algorithms. However, distributed computing is often bottle necked by a small fraction of slow processors called “stragglers” that reduce the speed of computation because the fusion node has to wait for all processors to complete their processing. To combat the effect of stragglers, recent literature proposes introducing redundancy in computations across processors, e.g., using repetitionbased strategies or erasure codes. The fusion node can exploit this redundancy by completing the computation using outputs from only a subset of the processors, ignoring the stragglers. In this paper, we propose a novel technique – that we call “ShortDot” – to introduce redundant computations in a coding theory inspired fashion, for computing linear transforms of long vectors. Instead of computing long dot products as required in the original linear transform, we construct a larger number of redundant and short dot products that can be computed faster and more efficiently at individual processors. In reference to comparable schemes that introduce redundancy to tackle stragglers, ShortDot reduces the cost of computation, storage and communication since shorter portions are stored and computed at each processor, and also shorter portions of the input is communicated to each processor. Further, only a subset of these short dot products are required at the fusion node to finish the computation successfully, thus enabling us to ignore stragglers. We demonstrate through probabilistic analysis as well as experiments on computing clusters that ShortDot offers significant speedup compared to existing techniques. We also derive tradeoffs between the length of the dotproducts and the resilience to stragglers (number of processors to wait for), for any such strategy and compare it to that achieved by our strategy.
“ShortDot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products
Sanghamitra Dutta Carnegie Mellon University sanghamd@andrew.cmu.edu Viveck Cadambe Pennsylvania State University viveck@engr.psu.edu Pulkit Grover Carnegie Mellon University pgrover@andrew.cmu.edu
noticebox[b]30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\end@float
1 Introduction
This work proposes a codingtheory inspired computation technique for speeding up computing linear transforms of highdimensional data by distributing it across multiple processing units that compute shorter dot products. Our main focus is on addressing the “straggler effect,” i.e., the problem of delays caused by a few slow processors that bottleneck the entire computation. To address this problem, we provide techniques (building on [kananspeeding] [gauristraggler] [gauriefficient] [gauri2014delay] [huang2012codes]) that introduce redundancy in the computation by designing a novel errorcorrection mechanism that allows the size of individual dot products computed at each processor to be shorter than the length of the input. Shorter dot products offer advantages in computation, storage and communication in distributed linear transforms.
The problem of computing linear transforms of highdimensional vectors is “the" critical step [dally2015] in several machine learning and signal processing applications. Dimensionality reduction techniques such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), taking random projections, require the computation of short and fat linear transforms on highdimensional data. Linear transforms are the building blocks of solutions to various machine learning problems, e.g., regression and classification etc., and are also used in acquiring and preprocessing the data through Fourier transforms, wavelet transforms, filtering, etc. Fast and reliable computation of linear transforms are thus a necessity for lowlatency inference [dally2015]. Due to saturation of Moore’s law, increasing speed of computing in a single processor is becoming difficult, forcing practitioners to adopt parallel processing to speed up computing for ever increasing data dimensions and sizes.
Classical approaches of computing linear transforms across parallel processors, e.g., BlockStriped Decomposition [kumar1994introduction], Fox’s method [fox1987matrix, kumar1994introduction], and Cannon’s method [kumar1994introduction], rely on dividing the computational task equally among all available processors^{1}^{1}1Strassen’s algorithm [strassen1969gaussian] and its generalizations offer a recursive approach to faster matrix multiplications over multiple processors, but they are often not preferred because of their high communication cost [ballard2014communication]. without any redundant computation. The fusion node collects the outputs from each processors to complete the computation and thus has to wait for all the processors to finish. In almost all distributed systems, a few slow or faulty processors – called “stragglers”[straggler_tail] – are observed to delay the entire computation. This unpredictable latency in distributed systems is attributed to factors such as network latency, shared resources, maintenance activities, and power limitations. In order to combat with stragglers, cloud computing frameworks like Hadoop [hadoop] employ various straggler detection techniques and usually reset the task allotted to stragglers. Forward errorcorrection techniques offer an alternative approach to deal with this “straggler effect” by introducing redundancy in the computational tasks across different processors. The fusion node now requires outputs from only a subset of all the processors to successfully finish. In this context, the use of preliminary erasure codes dates back to the ideas of algorithmic fault tolerance [ABFT1984] [faultbook]. Recently optimized Repetition and Maximum Distance Separable (MDS) [ryan2009channel] codes have been explored [gauristraggler] [gauriefficient] [kananspeeding] [mohammad2016] to speed up computations.
We consider the problem of computing where is a given matrix and is a vector that is input to the computation . In contrast with [kananspeeding], which also uses codes to compute linear transforms in parallel, we allow the size of individual dot products computed at each processor to be smaller than , the length of the input.
Why might one be interested in computing short dot products while performing an overall large linear transform?
One reason is straightforward: the computation time depends on the length of the dotproducts computed. Processors are also inherently memory limited, which limits the size of dot products that can be computed. In some distributed and cloud computing systems, the computation time is dominated by the time taken to communicate to the processors. In systems where multicasting is not possible or is inefficient, it may be faster to communicate a subset of the coordinates of to each processor. In such systems, we anticipate that communicating shorter vectors, each formed by these subsets of coordinates of , is likely to result in substantial speedups over schemes that require the entire vector (in particular when multicasting is difficult).^{2}^{2}2Another interesting example comes from recent work on designing processing units that exclusively compute dotproducts using analog components [analog_dot, ericpop]. These devices are prone to errors and increased delays in convergence when designed for larger dot products. In Sections 4 and 6, we show both theoretically (under model assumptions inspired from [kananspeeding] that admit simplified expected time analysis while being a crude approximation of experimental observations) and experimentally that the speedup using ShortDot can be increased beyond that obtained using the strategy proposed in [kananspeeding], in stragglerprone environments.
To summarize, our main contributions are:

To compute for a given matrix , we instead compute where we construct (total no. of processors > Required no. of dotproducts ) such that each length row of has at most nonzero elements. Because the locations of zeros in a row of are known by design, this reduces the complexity of computing dotproducts of rows of with . Here parameterizes the resilience to stragglers: any of the dot products of rows of with are sufficient to recover , i.e., any rows of can be linearly combined to generate the rows of .

We provide fundamental limits on the tradeoff between the length of the dotproducts and the straggler resilience (number of processors to wait for) for any such strategy in Section 3. This suggests a lower bound on the length of task allotted per processor. Our limits show that ShortDot is nearoptimal.

Assuming exponential tails of servicetimes at each server (used in [kananspeeding]), we derive the expected computation time required by our strategy and compare it to uncoded parallel processing, repetition strategy and MDS coding [ryan2009channel] (see Fig. 2) based linear computation. We also explicitly show a regime () where ShortDot outperforms all its competing strategies in expected computation time, by a factor of , that diverges to infinity for large . In general, ShortDot is found to be universally faster than all its competing strategies over the entire range of . When is linear in , ShortDot offers speedup by a factor of over uncoded, parallel processing and repetition. When is sublinear in , ShortDot outperforms repetition or MDS coding based linear computations by a factor of .

We also provide experimental results showing that ShortDot is faster than existing strategies.
In a concurrent work, in [gradientcoding], Tandon et al. consider a coded computation problem similar to ours for the special case where , the number of length dotproducts to be computed, is and the given matrix (in this case just a single row vector) is . We note that for , the gain of using coded strategies over replicationbased strategies is bounded even as and for . Our paper differs from [gradientcoding] in that we consider the more general case , and observe that the gains over replication can be unbounded with this scaling in the regime . For , the number of operations per processor using our strategy is lower than an application of [gradientcoding] for the same worstcase straggler resilience. To see this, note that a straightforward extension of the strategy proposed in [gradientcoding] that encodes each row of separately for rows would require dotproducts of length at each processor while using a “joint” encoding across rows, ShortDot only requires a single dotproduct of length (note that ) at each processor, while still requiring the same number of processors (any out of ) to finish. Further, we also provide a tighter converse for that proves that ShortDot is nearoptimal. It is worth noting that [gradientcoding] also introduces the notion of partial stragglers, which is outside the scope of our paper.
For the rest of the paper, we define the sparsity of a vector as the number of nonzero elements in the vector, i.e., . We also assume is quite large compared to , so that it is reasonable to assume that divides ().
1.1 Comparison with existing strategies:
Consider the problem of computing a single dot product of an input vector with a prespecified vector . By an “uncoded” parallel processing strategy (which includes Block Striped Decomposition [kumar1994introduction]), we mean a strategy that does not use redundancy to overcome delays caused by stragglers. One uncoded strategy is to partition the dot product into smaller dot products, where is the number of available processors. E.g. can be divided into parts – constructing short vectors of sparsity – with each vector stored in a different processor (as shown in Fig. 1 left). Only the nonzero values of the vector need to be stored since the locations of the nonzero values is known apriori at every node. One might expect the computation time for each processor to reduce by a factor of . However, now the fusion node has to wait for all the processors to finish their computation, and the stragglers can now delay the entire computation. Can we construct vectors such that dot products of a subset of them with are sufficient to compute ? A simple coded strategy is Repetition with block partitioning i.e., constructing vectors of sparsity by partitioning the vector of length into parts , and repeating the vectors times so as to obtain vectors of sparsity as shown in Fig. 1 (right). For each of the parts of the vector, the fusion node only needs the output of one processor among all its repetitions. Instead of a single dotproduct, if one requires the dotproduct of with vectors , one can simply repeat the aforementioned strategy times.
For multiple dotproducts, an alternative repetitionbased strategy is to compute dot products times in parallel at different processors. Now we only have to wait for at least one processor corresponding to each of the vectors to finish (see Fig. 2). Improving upon repetition, it is shown in [kananspeeding] that an MDS code allows constructing coded vectors such that any of dotproducts can be used to reconstruct all the original vectors (see Fig. 2). This strategy is shown, both experimentally and theoretically, to perform better than repetition and uncoded strategies.
Can we go beyond MDS codes? MDS codesbased strategies require length dotproducts to be computed on each processor. ShortDot instead constructs vectors of sparsity (less than ), such that the dot product of with any out of these short vectors is sufficient to compute the dotproduct of with all the given vectors (see Fig. 2). Compared to MDS Codes, ShortDot is more flexible as it waits for some more processors (since ), but each processor computes a shorter dot product. ShortDot also effectively reduces the communication cost since only a shorter portion of the input vector is to be communicated to each processor. We also propose ShortMDS, an extension of the MDS codesbased strategy in [kananspeeding] to create short dotproducts of length , through block partitioning, and compare it with ShortDot. In regimes where is an integer, ShortMDS may be viewed as a special case of ShortDot. But when is not an integer, ShortMDS has to wait for more processors in worst case than ShortDot for the same sparsity , as discussed in Remark 2 in Section 2.
2 Our coded parallelization strategy: ShortDot
In this section, we provide our strategy of computing the linear transform where is the input vector and is a given matrix. ShortDot constructs a matrix such that predetermined linear combinations of any rows of are sufficient to generate each of , and any row of has sparsity at most . Each sparse row of (say ) is sent to the th processor () and dotproducts of with all sparse rows are computed in parallel. Let denote the support (set of nonzero indices) of . Thus, for any unknown vector , short dot products of length are computed on each processor. Since the linear combination of any rows of can generate the rows of , i.e., , the dotproduct from the earliest out of processors can be linearly combined to obtain the linear transform . Before formally stating our algorithm, we first provide an insight into why such a matrix exists in the following theorem, and develop an intuition on the construction strategy.
Theorem 1
Given row vectors , there exists a matrix such that a linear combination of any rows of the matrix is sufficient to generate the row vectors and each row of has sparsity at most , provided divides .
Proof: We may append rows to , to form a matrix . The precise choice of these additional vectors will be made explicit later. Next, we choose , a matrix such that any square submatrix of is invertible^{3}^{3}3This condition is relaxed in Remark 1. The following lemma shows that any rows of the matrix are sufficient to generate any row of , including :
Lemma 1
Let where is a matrix and is any matrix such that every square submatrix is invertible. Then, any rows of can be linearly combined to generate any row of .
Proof: Choose an arbitrary index set such that . Let be the submatrix formed by chosen rows of indexed by . Then, . Now, is a submatrix of , and is thus invertible. Thus, . The th row of is th Row of for . Thus, each row of is generated by the chosen rows of .
In the next lemma, we show how the row sparsity of can be constrained to be at most by appropriately choosing the appended vectors .
Lemma 2
Given an matrix , let be a matrix formed by appending row vectors to . Also let be a matrix such that every square matrix is invertible. Then there exists a choice of the appended vectors such that each row of has sparsity at most .
Proof: We select a sparsity pattern that we want to enforce on and then show that there exists a choice of the appended vectors such that the pattern can be enforced.
Sparsity Pattern enforced on : This is illustrated in Fig. 4. First, we construct a “unit block” with a cyclic structure of nonzero entries, where zeros in each row and column are arranged as shown in Fig. 4. Each row and column have at most nonzero entries. This unit block is replicated horizontally times to form an matrix with at most nonzero entries in each column, and and at most nonzero entries in each row. We now show how choice of can enforce this pattern on .
From , the th column of can be written as, . Each column of has at least zeros at locations indexed by . Let denote a submatrix of consisting of the rows of indexed by . Thus, .
Divide into two portions of lengths and as follows:
Here is actually the th column of given matrix and depends on the choice of the appended vectors. Thus,
(1)  
(2) 
where the last step uses the fact that is invertible because it is a square submatrix of . This explicitly provides the vector which completes the th column of . The other columns of can be completed similarly, proving the lemma.
From Lemmas 1 and 2, for a given matrix , there always exists a matrix such that a linear combination of any columns of is sufficient to generate our given vectors and each row of has sparsity at most . This proves the theorem.
Remark 1: Relaxed conditions on matrix
It has been stated in Lemmas and that all square submatrices of need to be invertible. A matrix with i.i.d. Gaussian entries can be shown to satisfy this property with probability . In fact the condition on in Lemmas and can be relaxed, as evident from the proof. For matrix we only need two conditions. (1) All square submatrices are invertible. (2) All square submatrices in the last columns of are invertible. A Vandermonde Matrix satisfies both these properties and thus can be used for encoding in ShortDot.
With this insight in mind, we now formally state our computation strategy:
Strategy  Length  Parameter 

Repetition  
MDS  
ShortDot 
Strategy  Length  Parameter 

Repetition with block partition  
ShortMDS 
Remark 2: ShortMDS  a special case of ShortDot
An extension of the MDS codesbased strategy proposed in [kananspeeding], that we call ShortMDS can be designed to achieve rowsparsity . First blockpartition the matrix of columns, into submatrices of size , and also divide the total processors equally into parts. Now, each submatrix can be encoded using a MDS code. In the worst case, including all integer effects, this strategy requires processors to finish. In comparison, ShortDot requires processors to finish. In the regime where, exactly divides , ShortMDS can be viewed as a special case of ShortDot, as both the expressions match. However, in the regime where does not exactly divide , ShortMDS requires more processors to finish in the worst case than ShortDot. ShortDot is a generalized framework that can achieve a wider variety of prespecified sparsity patterns as required by the application. In Table 1, we compare the lengths of the dotproducts and straggler resilience , i.e., the number of processors to wait for in worst case, for different strategies.
3 Limits on tradeoff between the length of dotproducts and parameter K
In this section, we derive fundamental tradeoffs between the length of the dotproducts computed at each individual processor and the number of processors to wait for, i.e., , which parametrizes the resilience to stragglers. First we derive an informationtheoretic limit in Theorem 2 that holds for any matrix , such that each column has at least one nonzero entry^{4}^{4}4Note that choice of such a class of matrix is reasonable, since if say the th column of consists entirely of zeros, then the th column and its corresponding entry in unknown vector can simply be omitted from the problem.. In Theorem 3, we show how this bound can be tightened further, so that in the limit of large number of columns of matrix , ShortDot is nearoptimal.
Theorem 2
Let be any matrix such that each column has at least one nonzero element. For any matrix satisfying the property that the span of its any rows contains the span of the rows of , the average sparsity over the rows of must satisfy .
Proof: We claim that is strictly greater than the maximum number of zeros that can occur in any column of the matrix . If not, suppose the th column of has more than zeros. Then there exists a choice of rows of such that any linear combination of these rows will always be at the th column index. However, since the th column of has at least one nonzero entry, say at row , it is not possible to generate the th row of by linearly combining these chosen rows of . Thus,
(3)  
(4) 
Here the last line follows since maximum value is always greater than average. Note that if is the average sparsity over the rows of , then the average number of zeros over the columns of can be written as . Thus, from (4),
(5) 
A slight rearrangement establishes the lower bound in Theorem 2.
Recall that, ShortDot achieves a column sparsity of at most while a hard lower bound is from this proof. The bound is tight for . The bound on average rowsparsity is also tight only for (implicitly assuming divides , since ). Now we tighten this bound further for .
3.1 Tighter Fundamental Bounds
Theorem 3
Let . Then there exists a matrix , such that any satisfying the property that any rows of can span all the rows of , must also satisfy the following property:
The average sparsity over the rows of is lower bounded as
(6) 
Moreover, if is sufficiently large, such that , then the average sparsity over the rows of is lower bounded as
(7) 
Note that the second term in the lower bound in (6) does not depend on . Thus, if is sufficiently larger than and , the second term in the lower bound becomes negligible compared to the first term, and the first term is precisely what ShortDot can achieve. Thus, from this lower bound, we can conclude that when is large, ShortDot is near optimal.
Before proceeding with the proof, we give a basic intuition on the proof technique. We basically divide the columns of into two groups, one with at most zeros, and other with more than zeros. Then we show that there exist matrices such that the number of columns in the latter group, i.e., with more than zeros is bounded, and this in turn bounds the average sparsity. Now we formally prove the theorem.
Proof: Let us denote the number of columns of with more than zeros as . We will show later in Lemma 3 that . Now, compute the average number of zeros over the columns of . The columns of can be divided into two groups : columns with greater than zeros and columns with atmost zeros. Recall from (3), that if is chosen such that every column has atleast one nonzero entry, then the maximum number of zeros in any column of is upper bounded by . Thus, the group of columns can have atmost zeros each. Thus,
(8) 
If is the average sparsity of each row of , then the average zeros of each column of is given by . Thus,
(9) 
After slight rearrangement, the average sparsity of each row of can be bounded as:
(10) 
Thus, the first part of the theorem, i.e., (6) is proved. Using the condition that in (6), we can also obtain (7). Thus,
(11) 
Thus, the theorem is proved.
Now it only remains to prove Lemma 3.
Lemma 3: Let . Then there exists a matrix , such that any satisfying the property that any rows of can span all the rows of , must also satisfy the following property: The number of columns with more than zeros is upper bounded as .
Proof: Assume, . Now, a column with more than zeros will have at least zeros. There can be at most different patterns in which zeros can occur in a column of length . Every column with more than zeros also has one of these column sparsity pattern, just with more zeros. From a pigeonhole argument, at least one of these sparsity patterns of zeros will surely occur in columns or more. Let us consider the submatrix of , of size , consisting of only the columns of having zeros in the same locations, i.e., with similar sparsity pattern. Any rows of this submatrix of should generate all the rows of a corresponding submatrix of the given , consisting of the same columns of as picked in this submatrix of .
There always exists a fully dense matrix such any submatrix of is fullrank, since can be arbitrary. This submatrix of is of rank (from assumption). Any rows of the submatrix of , should generate linearly independent rows of this submatrix of . But since the submatrix of has rows consisting of all zeros, there is a choice of rows, such that all these zero rows are chosen, and we are only left with at most nonzero rows to generate linearly independent rows of . This is a contradiction. Thus, we must have .
4 Analysis of expected computation time for exponential tail models
We now provide a probabilistic analysis of the computation time required by ShortDot and compare it with uncoded parallel processing, repetition and MDS coding based linear computation scheme as shown in Fig. 5. We follow the shiftedexponential computation time model as described in [kananspeeding]. Although the shifted exponential distribution may only be a crude approximation of the delay of real systems, we use the shifted exponential model since it is analytically tractable and allows for a fair comparison with the strategy proposed in [kananspeeding]. We assume that the time required by a processor to compute a single dotproduct of length be distributed as:
(12) 
Here, is the “straggling parameter” that determines the unpredictable latency in computation time. Intuitively, the shifted exponential model states that for a task of size , there is a minimum time offset proportional to such that the probability of completion of the task before that time is . The probability of task completion is maximum at the timeoffset and then decays with an exponential tail after that. This nature of the model might be attributed to the fact that while a processor is most likely to finish its task of size at a time proportional to , but an unpredictable latency due to queuing and various other factors causes an exponential tail. For an length dot product, we simply replace by in (12), as suggested in [kananspeeding]. The analysis of expected computation time requires closed form expressions of the th statistic which is simplistic for exponential tails. However a more thorough empirical study is necessary to establish any chosen model for straggling in a particular environment.
The expected computation time for ShortDot is the expected value of the th order statistic of these iid exponential random variables, which is given by:
(13) 
Here, (13) uses the fact that the expected value of the th statistic of iid exponential random variables with parameter is [kananspeeding]. The expected computation time in the RHS of (13) is minimized when . This minimal expected time is for linear in and is for sublinear in .
A detailed analysis of the expected computation time for the competing strategies, i.e., uncoded strategy, repetition and MDS coding strategy is provided in the Appendix. Table 2 shows the ordersense expected computation time in the regimes where is linear and sublinear in .
Strategy  Expected Time  linear in  sublinear in 

Only one Processor  
Uncoded (M divides P)^{2}  
Repetition (M divides P) ^{2}  
MDS  
ShortDot 

Refer to Appendix for more accurate analysis taking integer effects into account
Note that in the regime where is linear in , ShortDot outperforms Uncoded Strategy by a factor diverging to infinity for large . Similarly, in the regime where is sublinear in , ShortDot outperforms MDS coding strategy by a factor that diverges to infinity for large . Thus ShortDot universally outperforms all its competing strategies over the entire range of .
Now we explicitly provide a regime, where the speedups from ShortDot diverges to infinity for large , in comparison to all three competing strategies  MDS Coding, Repetition or Uncoded strategies.
Theorem 4
Suppose scales as . Then, ShortDot with has an expected computation time (scaled by ) as that decays to as . In contrast, the expected computation time (scaled by ) for MDS coding, repetition and uncoded strategies scale as and thus do not decay to as .
Proof: For the proof of this theorem, we simply substitute the values of and in the expressions of expected computation time as follows. We let for all the strategies. For uncoded strategy, we thus obtain,
(14) 
For repetition, we obtain,
(15) 
For MDS Coding based linear computation, we obtain,
(16) 
Now, we consider the ShortDot strategy with . Note that the inequality is satisfied for . Now let us calculate the expected computation time for ShortDot.
(17) 
Thus, the speedup offered by ShortDot in this regime is , and thus diverges to infinity for large , as illustrated in Fig. 6.
5 Encoding and Decoding Complexity
5.1 Encoding Complexity:
Even though encoding is a preprocessing step (since is assumed to be given in advance), we include a complexity analysis for the sake of completeness. Recall from Section 2 that we first choose an appropriate matrix of dimension , such that every square submatrix is invertible and all submatrices in the last columns are invertible. Now, for each of the columns of the given matrix , we perform the following.
For each of the columns, the encoding requires a matrix inversion of size to solve a linear system of equations, a matrixvector product of size and another matrix vector product of size .
The naive encoding complexity is therefore . Note that effectively there are only different column sparsity patterns for this particular design discussed in this paper. Thus, there are effectively unique s , and thus unique matrix inversions can suffice for all the columns, as sparsity pattern is repeated. Thus, the complexity can be reduced to
This is higher than MDS coding based linear computation that has an encoding complexity of , but it is only a onetime cost that provides savings in online steps (as discussed earlier in this section).
5.1.1 Reduced Complexity using Vandermonde matrices:
The encoding complexity can be reduced further for special choices of the matrix . Let us choose to be a Vandermonde matrix as given by
(18) 
Here, , and are all distinct. This matrix satisfies all the requirements of the encoding matrix. All submatrices of are invertible, and all submatrices in the last columns are also invertible. Thus, this matrix can be used to encode the matrix . For each of the columns of , the encoding requires solving a linear system of equations for , as given by:
(19) 
Here denotes a set of indices .
The matrixvector product is equivalent to the evaluation of a polynomial of degree with the coefficients as at arbitrary points given by . Once this product is obtained, the linear system of equations reduces to the interpolation of the unknown coefficients of a polynomial of degree (which is ), from its value at arbitrary points as given by . Once is obtained, we perform the following operation.
(20) 
This step is equivalent to the evaluation of a polynomial of degree at points given by . Thus we decompose our encoding problem for each column of into a bunch of polynomial evaluation and interpolation problems, all of degree less than . Now, from [kung1973fast], [li2000arithmetic], we know that both the interpolation and the evaluation of a polynomial of degree less than , at arbitrary points is . Thus, the complexity of encoding is .
5.2 Decoding Complexity:
During decoding, we get dotproducts from the first processors out of . We then perform the following operations.
We solve a system of linear equations in variables and use only values of the obtained solution vector. Thus, effectively we do a single matrix inversion of size followed by a matrixvector product of size . The decoding complexity of ShortDot is thus which does not depend on when . This is nearly the same as complexity of MDS coding based linear computation.
5.2.1 Reduced Complexity using Vandermonde matrices:
Similar to encoding, using Vandermonde matrices can reduce the decoding complexity further. As already discussed, we choose the encoding matrix as a Vandermonde matrix as described in (18). The decoding problem consists of solving a a system of linear equations in variables.
(21) 
Here is a set of indices . The problem of finding is equivalent to the interpolation of the coefficients of a polynomial of degree , from its values at arbitrary points given by . Again, from [kung1973fast], [li2000arithmetic], the interpolation of a polynomial of degree , at arbitrary points can be done in , which thus becomes the decoding complexity.
6 Experimental Results
We perform experiments on computing clusters at CMU to test the computational time. We use HTCondor [HTCondor] to schedule jobs simultaneously among the processors. We compare the time required to classify handwritten digits of the MNIST [lecun1998mnist] database, assuming we are given a trained layer Neural Network. We separately trained the Neural network using training samples, to form a matrix of weights, denoted by . For testing, the multiplication of this given matrix, with the test data matrix is considered. The total number of processors was .
Assuming that is encoded into in a preprocessing step, we store the rows of in each processor apriori. Now portions of the data matrix of size are sent to each of the parallel processors as input. We also send a Cprogram to compute dotproducts of length with appropriate rows of using command condorsubmit. Each processor outputs the value of one dotproduct. The computation time reported in Fig. 7 includes the total time required to communicate inputs to each processor, compute the dotproducts in parallel, fetch the required outputs, decode and classify all the testimages, based on experimental runs.
Strategy  Parameter  Mean  STDEV  Minimum Time  Maximum Time 

Uncoded  20  11.8653  2.8427  9.5192  27.0818 
ShortDot  18  10.4306  0.9253  8.2145  11.8340 
MDS  10  15.3411  0.8987  13.8232  17.5416 
Key Observations: (See Table 3 for detailed results). Computation time varies based on nature of straggling, at the particular instant of the experimental run. ShortDot outperforms both MDS and Uncoded, in mean computation time. Uncoded is faster than MDS since perprocessor computation time for MDS is larger, and it increases the straggling, even though MDS waits for only for out of processors. However, note that Uncoded has more variability than both MDS and ShortDot, and its maximum time observed during the experiment is much greater than both MDS and ShortDot. The classification accuracy was on test data.
Comment:
The experimental times are quite high due to some limitations of the experimental platform used. The time includes some overhead to start the cluster, and communicate data in the form of text files to all the processors, and also collect the output data files back from all the processors. The read time also depends on the size of file to be read. Currently, we are looking at performing these experiments in alternate distributed computing platforms, with better communication protocols.
7 Discussion
7.1 Storage and Communication benefits of Shorter Dot Products:
The major advantage of using ShortDot codes over the MDS coding strategy in [kananspeeding] is that the length of the prestored vectors (rows of ) as well as the communicated input (portions of ) is shorter than . It is thus applicable when processing units have limitations of memory, and it is not possible to prestore the long vectors of length . Shortdot also has advantages over [kananspeeding] in systems where the principle bottlenecks in computation time is in communicating the input to all the processors, and it may not be feasible to broadcast (multicast) to all processors at the same time. Thus, it is also useful in applications where communication costs are predominant over computation costs.
7.2 Errors instead of erasures:
While we focus on the problem of erasures in this paper, ShortDot can also be used to correct errors. Consider the scenario when instead of straggling or failures, some processors return entirely faulty or garbage outputs, in a distributed system and we do not know which of the outputs are erroneous. We argue from coding theoretic arguments that ShortDot codes designed to tolerate stragglers, can also correct errors. First observe that if the code can tolerate stragglers, then the Hamming Distance between any two codewords should at least be . Hence, the number of errors that can be corrected is which is . The same result can also be derived by recasting the decoding problem as a sparse reconstruction problem, and borrowing ideas from standard compressive sensing literature [candes2005decoding] which also yields a concrete, decoding algorithm. The problem reduces to an minimization problem, which can be relaxed into an minimization, or solved using alternate sparse reconstruction techniques, under certain constraints on the encoding matrix .
7.3 More dotproducts than processors
While we have presented the case of here, ShortDot easily generalizes to the case where . The matrix can be divided horizontally into several chunks along the row dimension (shorter matrices) and ShortDot can be applied on each of those chunks one after another. Moreover if rows with same sparsity pattern are grouped together and stored in the same processor initially, then the communication cost is also significantly reduced during the online computations, since only some elements of the unknown vector are sent to a particular processor.
Acknowledgments: Systems on Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by MARCO and DARPA. We also acknowledge NSF Awards 1350314, 1464336 and 1553248. S Dutta also received Prabhu and Poonam Goel Graduate Fellowship.
References
8 Appendix
We now provide a probabilistic analysis of the computational time required by ShortDot and compare it with uncoded parallel processing, repetition and MDS code based linear computation as shown in Fig. 5. We assume that the time required by a processor to compute a single dotproduct follows an exponential distribution and is independent of other parallel processors.
Let us assume, the time required to compute a single dotproduct of length , follow the distribution:
(22) 
Here, is a straggling parameter, that determines the “unpredictable latency” in computation time. We also assume, that if the length of the dotproduct is where is the sparsity of the vector, the probability distribution of the computational time varies as:
(23) 
Now we derive the expected computation time using our proposed strategy and compare it with existing strategies in the regimes where the number of dotproducts is linear and sublinear in .
Table 2 shows the ordersense expected computation time in the regimes where is linear and sublinear in