“Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products

“Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products

Sanghamitra Dutta
Carnegie Mellon University
sanghamd@andrew.cmu.edu &Viveck Cadambe
Pennsylvania State University
viveck@engr.psu.edu &Pulkit Grover
Carnegie Mellon University
pgrover@andrew.cmu.edu
Abstract

Faced with saturation of Moore’s law and increasing size and dimension of data, system designers have increasingly resorted to parallel and distributed computing to reduce computation time of machine-learning algorithms. However, distributed computing is often bottle necked by a small fraction of slow processors called “stragglers” that reduce the speed of computation because the fusion node has to wait for all processors to complete their processing. To combat the effect of stragglers, recent literature proposes introducing redundancy in computations across processors, e.g., using repetition-based strategies or erasure codes. The fusion node can exploit this redundancy by completing the computation using outputs from only a subset of the processors, ignoring the stragglers. In this paper, we propose a novel technique – that we call “Short-Dot” – to introduce redundant computations in a coding theory inspired fashion, for computing linear transforms of long vectors. Instead of computing long dot products as required in the original linear transform, we construct a larger number of redundant and short dot products that can be computed faster and more efficiently at individual processors. In reference to comparable schemes that introduce redundancy to tackle stragglers, Short-Dot reduces the cost of computation, storage and communication since shorter portions are stored and computed at each processor, and also shorter portions of the input is communicated to each processor. Further, only a subset of these short dot products are required at the fusion node to finish the computation successfully, thus enabling us to ignore stragglers. We demonstrate through probabilistic analysis as well as experiments on computing clusters that Short-Dot offers significant speed-up compared to existing techniques. We also derive trade-offs between the length of the dot-products and the resilience to stragglers (number of processors to wait for), for any such strategy and compare it to that achieved by our strategy.

 

“Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products


  Sanghamitra Dutta Carnegie Mellon University sanghamd@andrew.cmu.edu Viveck Cadambe Pennsylvania State University viveck@engr.psu.edu Pulkit Grover Carnegie Mellon University pgrover@andrew.cmu.edu

\@float

noticebox[b]30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\end@float

1 Introduction

This work proposes a coding-theory inspired computation technique for speeding up computing linear transforms of high-dimensional data by distributing it across multiple processing units that compute shorter dot products. Our main focus is on addressing the “straggler effect,” i.e., the problem of delays caused by a few slow processors that bottleneck the entire computation. To address this problem, we provide techniques (building on [kananspeeding] [gauristraggler] [gauriefficient] [gauri2014delay] [huang2012codes]) that introduce redundancy in the computation by designing a novel error-correction mechanism that allows the size of individual dot products computed at each processor to be shorter than the length of the input. Shorter dot products offer advantages in computation, storage and communication in distributed linear transforms.

The problem of computing linear transforms of high-dimensional vectors is “the" critical step [dally2015] in several machine learning and signal processing applications. Dimensionality reduction techniques such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), taking random projections, require the computation of short and fat linear transforms on high-dimensional data. Linear transforms are the building blocks of solutions to various machine learning problems, e.g., regression and classification etc., and are also used in acquiring and pre-processing the data through Fourier transforms, wavelet transforms, filtering, etc. Fast and reliable computation of linear transforms are thus a necessity for low-latency inference [dally2015]. Due to saturation of Moore’s law, increasing speed of computing in a single processor is becoming difficult, forcing practitioners to adopt parallel processing to speed up computing for ever increasing data dimensions and sizes.

Classical approaches of computing linear transforms across parallel processors, e.g., Block-Striped Decomposition [kumar1994introduction], Fox’s method [fox1987matrix, kumar1994introduction], and Cannon’s method [kumar1994introduction], rely on dividing the computational task equally among all available processors111Strassen’s algorithm [strassen1969gaussian] and its generalizations offer a recursive approach to faster matrix multiplications over multiple processors, but they are often not preferred because of their high communication cost [ballard2014communication]. without any redundant computation. The fusion node collects the outputs from each processors to complete the computation and thus has to wait for all the processors to finish. In almost all distributed systems, a few slow or faulty processors – called “stragglers”[straggler_tail] – are observed to delay the entire computation. This unpredictable latency in distributed systems is attributed to factors such as network latency, shared resources, maintenance activities, and power limitations. In order to combat with stragglers, cloud computing frameworks like Hadoop [hadoop] employ various straggler detection techniques and usually reset the task allotted to stragglers. Forward error-correction techniques offer an alternative approach to deal with this “straggler effect” by introducing redundancy in the computational tasks across different processors. The fusion node now requires outputs from only a subset of all the processors to successfully finish. In this context, the use of preliminary erasure codes dates back to the ideas of algorithmic fault tolerance [ABFT1984] [faultbook]. Recently optimized Repetition and Maximum Distance Separable (MDS) [ryan2009channel] codes have been explored [gauristraggler]  [gauriefficient]  [kananspeeding] [mohammad2016] to speed up computations.

We consider the problem of computing where is a given matrix and is a vector that is input to the computation . In contrast with [kananspeeding], which also uses codes to compute linear transforms in parallel, we allow the size of individual dot products computed at each processor to be smaller than , the length of the input.

Why might one be interested in computing short dot products while performing an overall large linear transform?
One reason is straightforward: the computation time depends on the length of the dot-products computed. Processors are also inherently memory limited, which limits the size of dot products that can be computed. In some distributed and cloud computing systems, the computation time is dominated by the time taken to communicate to the processors. In systems where multi-casting is not possible or is inefficient, it may be faster to communicate a subset of the co-ordinates of to each processor. In such systems, we anticipate that communicating shorter vectors, each formed by these subsets of coordinates of , is likely to result in substantial speedups over schemes that require the entire vector (in particular when multi-casting is difficult).222Another interesting example comes from recent work on designing processing units that exclusively compute dot-products using analog components [analog_dot, ericpop]. These devices are prone to errors and increased delays in convergence when designed for larger dot products. In Sections 4 and 6, we show both theoretically (under model assumptions inspired from [kananspeeding] that admit simplified expected time analysis while being a crude approximation of experimental observations) and experimentally that the speed-up using Short-Dot can be increased beyond that obtained using the strategy proposed in  [kananspeeding], in straggler-prone environments.

To summarize, our main contributions are:

  1. To compute for a given matrix , we instead compute where we construct (total no. of processors > Required no. of dot-products ) such that each -length row of has at most non-zero elements. Because the locations of zeros in a row of are known by design, this reduces the complexity of computing dot-products of rows of with . Here parameterizes the resilience to stragglers: any of the dot products of rows of with are sufficient to recover , i.e., any rows of can be linearly combined to generate the rows of .

  2. We provide fundamental limits on the trade-off between the length of the dot-products and the straggler resilience (number of processors to wait for) for any such strategy in Section 3. This suggests a lower bound on the length of task allotted per processor. Our limits show that Short-Dot is near-optimal.

  3. Assuming exponential tails of service-times at each server (used in [kananspeeding]), we derive the expected computation time required by our strategy and compare it to uncoded parallel processing, repetition strategy and MDS coding [ryan2009channel] (see Fig. 2) based linear computation. We also explicitly show a regime () where Short-Dot outperforms all its competing strategies in expected computation time, by a factor of , that diverges to infinity for large . In general, Short-Dot is found to be universally faster than all its competing strategies over the entire range of . When is linear in , Short-Dot offers speed-up by a factor of over uncoded, parallel processing and repetition. When is sub-linear in , Short-Dot out-performs repetition or MDS coding based linear computations by a factor of .

  4. We also provide experimental results showing that Short-Dot is faster than existing strategies.

In a concurrent work, in [gradientcoding], Tandon et al. consider a coded computation problem similar to ours for the special case where , the number of -length dot-products to be computed, is and the given matrix (in this case just a single row vector) is . We note that for , the gain of using coded strategies over replication-based strategies is bounded even as and for . Our paper differs from [gradientcoding] in that we consider the more general case , and observe that the gains over replication can be unbounded with this scaling in the regime . For , the number of operations per processor using our strategy is lower than an application of [gradientcoding] for the same worst-case straggler resilience. To see this, note that a straightforward extension of the strategy proposed in [gradientcoding] that encodes each row of separately for rows would require dot-products of length at each processor while using a “joint” encoding across rows, Short-Dot only requires a single dot-product of length (note that ) at each processor, while still requiring the same number of processors (any out of ) to finish. Further, we also provide a tighter converse for that proves that Short-Dot is near-optimal. It is worth noting that [gradientcoding] also introduces the notion of partial stragglers, which is outside the scope of our paper.

For the rest of the paper, we define the sparsity of a vector as the number of nonzero elements in the vector, i.e., . We also assume is quite large compared to , so that it is reasonable to assume that divides ().

1.1 Comparison with existing strategies:

Consider the problem of computing a single dot product of an input vector with a pre-specified vector . By an “uncoded” parallel processing strategy (which includes Block Striped Decomposition [kumar1994introduction]), we mean a strategy that does not use redundancy to overcome delays caused by stragglers. One uncoded strategy is to partition the dot product into smaller dot products, where is the number of available processors. E.g. can be divided into parts – constructing short vectors of sparsity – with each vector stored in a different processor (as shown in Fig. 1 left). Only the nonzero values of the vector need to be stored since the locations of the nonzero values is known apriori at every node. One might expect the computation time for each processor to reduce by a factor of . However, now the fusion node has to wait for all the processors to finish their computation, and the stragglers can now delay the entire computation. Can we construct vectors such that dot products of a subset of them with are sufficient to compute ? A simple coded strategy is Repetition with block partitioning i.e., constructing vectors of sparsity by partitioning the vector of length into parts , and repeating the vectors times so as to obtain vectors of sparsity as shown in Fig. 1 (right). For each of the parts of the vector, the fusion node only needs the output of one processor among all its repetitions. Instead of a single dot-product, if one requires the dot-product of with vectors , one can simply repeat the aforementioned strategy times.

Figure 1: A dot-product of length is being computed parallely using processors. (Left) Uncoded Parallel Processing - Divide into parts, (Right) Repetition with block partitioning.

For multiple dot-products, an alternative repetition-based strategy is to compute dot products times in parallel at different processors. Now we only have to wait for at least one processor corresponding to each of the vectors to finish (see Fig. 2). Improving upon repetition, it is shown in [kananspeeding] that an -MDS code allows constructing coded vectors such that any of dot-products can be used to reconstruct all the original vectors (see Fig. 2). This strategy is shown, both experimentally and theoretically, to perform better than repetition and uncoded strategies.

(a) Uncoded Parallel Processing
(b) Repetition Strategy
(c) Using MDS codes
(d) Using Short-Dot
Figure 2: Different strategies of parallel processing: Here dot-products of length are being computed using processors.

Can we go beyond MDS codes? MDS codes-based strategies require -length dot-products to be computed on each processor. Short-Dot instead constructs vectors of sparsity (less than ), such that the dot product of with any out of these short vectors is sufficient to compute the dot-product of with all the given vectors (see Fig. 2). Compared to MDS Codes, Short-Dot is more flexible as it waits for some more processors (since ), but each processor computes a shorter dot product. Short-Dot also effectively reduces the communication cost since only a shorter portion of the input vector is to be communicated to each processor. We also propose Short-MDS, an extension of the MDS codes-based strategy in [kananspeeding] to create short dot-products of length , through block partitioning, and compare it with Short-Dot. In regimes where is an integer, Short-MDS may be viewed as a special case of Short-Dot. But when is not an integer, Short-MDS has to wait for more processors in worst case than Short-Dot for the same sparsity , as discussed in Remark 2 in Section 2.

2 Our coded parallelization strategy: Short-Dot

In this section, we provide our strategy of computing the linear transform where is the input vector and is a given matrix. Short-Dot constructs a matrix such that predetermined linear combinations of any rows of are sufficient to generate each of , and any row of has sparsity at most . Each sparse row of (say ) is sent to the -th processor () and dot-products of with all sparse rows are computed in parallel. Let denote the support (set of non-zero indices) of . Thus, for any unknown vector , short dot products of length are computed on each processor. Since the linear combination of any rows of can generate the rows of , i.e., , the dot-product from the earliest out of processors can be linearly combined to obtain the linear transform . Before formally stating our algorithm, we first provide an insight into why such a matrix exists in the following theorem, and develop an intuition on the construction strategy.

Figure 3: Short-Dot: Distributes short dot-products over parallel processors, such that outputs from any out of processors are sufficient to compute successfully.
Theorem 1

Given row vectors , there exists a matrix such that a linear combination of any rows of the matrix is sufficient to generate the row vectors and each row of has sparsity at most , provided divides .

Proof: We may append rows to , to form a matrix . The precise choice of these additional vectors will be made explicit later. Next, we choose , a matrix such that any square sub-matrix of is invertible333This condition is relaxed in Remark 1. The following lemma shows that any rows of the matrix are sufficient to generate any row of , including :

Lemma 1

Let where is a matrix and is any matrix such that every square sub-matrix is invertible. Then, any rows of can be linearly combined to generate any row of .

Proof: Choose an arbitrary index set such that . Let be the sub-matrix formed by chosen rows of indexed by . Then, . Now, is a sub-matrix of , and is thus invertible. Thus, . The -th row of is -th Row of for . Thus, each row of is generated by the chosen rows of .

In the next lemma, we show how the row sparsity of can be constrained to be at most by appropriately choosing the appended vectors .

Lemma 2

Given an matrix , let be a matrix formed by appending row vectors to . Also let be a matrix such that every square matrix is invertible. Then there exists a choice of the appended vectors such that each row of has sparsity at most .

Proof: We select a sparsity pattern that we want to enforce on and then show that there exists a choice of the appended vectors such that the pattern can be enforced.
Sparsity Pattern enforced on : This is illustrated in Fig. 4. First, we construct a “unit block” with a cyclic structure of nonzero entries, where zeros in each row and column are arranged as shown in Fig. 4. Each row and column have at most non-zero entries. This unit block is replicated horizontally times to form an matrix with at most non-zero entries in each column, and and at most non-zero entries in each row. We now show how choice of can enforce this pattern on .

     

Figure 4: Sparsity pattern of : (Left) Unit Block ; (Right) Unit Block concatenated times to form matrix with row sparsity at most .

From , the -th column of can be written as, . Each column of has at least zeros at locations indexed by . Let denote a sub-matrix of consisting of the rows of indexed by . Thus, .

Divide into two portions of lengths and as follows:

Here is actually the -th column of given matrix and depends on the choice of the appended vectors. Thus,

(1)
(2)

where the last step uses the fact that is invertible because it is a square sub-matrix of . This explicitly provides the vector which completes the -th column of . The other columns of can be completed similarly, proving the lemma.

From Lemmas 1 and 2, for a given matrix , there always exists a matrix such that a linear combination of any columns of is sufficient to generate our given vectors and each row of has sparsity at most . This proves the theorem.

Remark 1: Relaxed conditions on matrix

It has been stated in Lemmas and that all square sub-matrices of need to be invertible. A matrix with i.i.d. Gaussian entries can be shown to satisfy this property with probability . In fact the condition on in Lemmas and can be relaxed, as evident from the proof. For matrix we only need two conditions. (1) All square sub-matrices are invertible. (2) All square sub-matrices in the last columns of are invertible. A Vandermonde Matrix satisfies both these properties and thus can be used for encoding in Short-Dot.

With this insight in mind, we now formally state our computation strategy:

[A] Pre-Processing Step: Encode (Performed Offline)
Given:
1:For do
2:      Set
3: The set of indices that are 0 for the -th column of
4:      Set
5:      Set is a row vector.
6:      Set is a column vector ( -th col of )
7:Encoded Output: Row representation of matrix
8:For do
9:      Store Indices of non-zero entries in the -th row of
10:      Send to -th processor -th row of sent to -th processor
[B] Online computations
External Input :
Resources: parallel processors
[B1] Parallelization Strategy: Divide task among parallel processors:
1:For do
2:      Send to the -th processor
3:      Compute at -th processor: denotes only the rows of vector indexed by
Output: from earliest processors
Algorithm 1 Short-Dot
[B2] Fusion Node: Decode the dot-products from the processor outputs:
1:Set K processors that finished first
2:Set
3:Set Col Vector of outputs from first processors
4:Set
5:Output:
Strategy Length Parameter
Repetition
MDS
Short-Dot
Strategy Length Parameter
Repetition with block partition
Short-MDS
Table 1: Trade-off between the length of the dot-products and parameter for different strategies

Remark 2: Short-MDS - a special case of Short-Dot

An extension of the MDS codes-based strategy proposed in [kananspeeding], that we call Short-MDS can be designed to achieve row-sparsity . First block-partition the matrix of columns, into sub-matrices of size , and also divide the total processors equally into parts. Now, each sub-matrix can be encoded using a MDS code. In the worst case, including all integer effects, this strategy requires processors to finish. In comparison, Short-Dot requires processors to finish. In the regime where, exactly divides , Short-MDS can be viewed as a special case of Short-Dot, as both the expressions match. However, in the regime where does not exactly divide , Short-MDS requires more processors to finish in the worst case than Short-Dot. Short-Dot is a generalized framework that can achieve a wider variety of pre-specified sparsity patterns as required by the application. In Table 1, we compare the lengths of the dot-products and straggler resilience , i.e., the number of processors to wait for in worst case, for different strategies.

3 Limits on trade-off between the length of dot-products and parameter K

In this section, we derive fundamental trade-offs between the length of the dot-products computed at each individual processor and the number of processors to wait for, i.e., , which parametrizes the resilience to stragglers. First we derive an information-theoretic limit in Theorem 2 that holds for any matrix , such that each column has at least one non-zero entry444Note that choice of such a class of matrix is reasonable, since if say the -th column of consists entirely of zeros, then the -th column and its corresponding entry in unknown vector can simply be omitted from the problem.. In Theorem 3, we show how this bound can be tightened further, so that in the limit of large number of columns of matrix , Short-Dot is near-optimal.

Theorem 2

Let be any matrix such that each column has at least one non-zero element. For any matrix satisfying the property that the span of its any rows contains the span of the rows of , the average sparsity over the rows of must satisfy .

Proof: We claim that is strictly greater than the maximum number of zeros that can occur in any column of the matrix . If not, suppose the -th column of has more than zeros. Then there exists a choice of rows of such that any linear combination of these rows will always be at the -th column index. However, since the -th column of has at least one non-zero entry, say at row , it is not possible to generate the -th row of by linearly combining these chosen rows of . Thus,

(3)
(4)

Here the last line follows since maximum value is always greater than average. Note that if is the average sparsity over the rows of , then the average number of zeros over the columns of can be written as . Thus, from (4),

(5)

A slight re-arrangement establishes the lower bound in Theorem 2.
Recall that, Short-Dot achieves a column sparsity of at most while a hard lower bound is from this proof. The bound is tight for . The bound on average row-sparsity is also tight only for (implicitly assuming divides , since ). Now we tighten this bound further for .

3.1 Tighter Fundamental Bounds

Theorem 3

Let . Then there exists a matrix , such that any satisfying the property that any rows of can span all the rows of , must also satisfy the following property:

The average sparsity over the rows of is lower bounded as

(6)

Moreover, if is sufficiently large, such that , then the average sparsity over the rows of is lower bounded as

(7)

Note that the second term in the lower bound in (6) does not depend on . Thus, if is sufficiently larger than and , the second term in the lower bound becomes negligible compared to the first term, and the first term is precisely what Short-Dot can achieve. Thus, from this lower bound, we can conclude that when is large, Short-Dot is near optimal.

Before proceeding with the proof, we give a basic intuition on the proof technique. We basically divide the columns of into two groups, one with at most zeros, and other with more than zeros. Then we show that there exist matrices such that the number of columns in the latter group, i.e., with more than zeros is bounded, and this in turn bounds the average sparsity. Now we formally prove the theorem.

Proof: Let us denote the number of columns of with more than zeros as . We will show later in Lemma 3 that . Now, compute the average number of zeros over the columns of . The columns of can be divided into two groups : columns with greater than zeros and columns with at-most zeros. Recall from (3), that if is chosen such that every column has at-least one non-zero entry, then the maximum number of zeros in any column of is upper bounded by . Thus, the group of columns can have at-most zeros each. Thus,

(8)

If is the average sparsity of each row of , then the average zeros of each column of is given by . Thus,

(9)

After slight re-arrangement, the average sparsity of each row of can be bounded as:

(10)

Thus, the first part of the theorem, i.e., (6) is proved. Using the condition that in (6), we can also obtain (7). Thus,

(11)

Thus, the theorem is proved.

Now it only remains to prove Lemma 3.

Lemma 3: Let . Then there exists a matrix , such that any satisfying the property that any rows of can span all the rows of , must also satisfy the following property: The number of columns with more than zeros is upper bounded as .

Proof: Assume, . Now, a column with more than zeros will have at least zeros. There can be at most different patterns in which zeros can occur in a column of length . Every column with more than zeros also has one of these column sparsity pattern, just with more zeros. From a pigeon-hole argument, at least one of these sparsity patterns of zeros will surely occur in columns or more. Let us consider the sub-matrix of , of size , consisting of only the columns of having zeros in the same locations, i.e., with similar sparsity pattern. Any rows of this sub-matrix of should generate all the rows of a corresponding sub-matrix of the given , consisting of the same columns of as picked in this sub-matrix of .

There always exists a fully dense matrix such any sub-matrix of is full-rank, since can be arbitrary. This sub-matrix of is of rank (from assumption). Any rows of the sub-matrix of , should generate linearly independent rows of this sub-matrix of . But since the sub-matrix of has rows consisting of all zeros, there is a choice of rows, such that all these zero rows are chosen, and we are only left with at most non-zero rows to generate linearly independent rows of . This is a contradiction. Thus, we must have .

4 Analysis of expected computation time for exponential tail models

We now provide a probabilistic analysis of the computation time required by Short-Dot and compare it with uncoded parallel processing, repetition and MDS coding based linear computation scheme as shown in Fig. 5. We follow the shifted-exponential computation time model as described in [kananspeeding]. Although the shifted exponential distribution may only be a crude approximation of the delay of real systems, we use the shifted exponential model since it is analytically tractable and allows for a fair comparison with the strategy proposed in [kananspeeding]. We assume that the time required by a processor to compute a single dot-product of length be distributed as:

(12)

Here, is the “straggling parameter” that determines the unpredictable latency in computation time. Intuitively, the shifted exponential model states that for a task of size , there is a minimum time offset proportional to such that the probability of completion of the task before that time is . The probability of task completion is maximum at the time-offset and then decays with an exponential tail after that. This nature of the model might be attributed to the fact that while a processor is most likely to finish its task of size at a time proportional to , but an unpredictable latency due to queuing and various other factors causes an exponential tail. For an length dot product, we simply replace by in (12), as suggested in [kananspeeding]. The analysis of expected computation time requires closed form expressions of the -th statistic which is simplistic for exponential tails. However a more thorough empirical study is necessary to establish any chosen model for straggling in a particular environment.

The expected computation time for Short-Dot is the expected value of the -th order statistic of these iid exponential random variables, which is given by:

(13)

Here, (13) uses the fact that the expected value of the -th statistic of iid exponential random variables with parameter is [kananspeeding]. The expected computation time in the RHS of (13) is minimized when . This minimal expected time is for linear in and is for sub-linear in .

Figure 5: Expected computation time: Short-Dot is faster than MDS when and Uncoded when , and is universally faster over the entire range of . For the choice of straggling parameter, Repetition is slowest. When does not exactly divide , the distribution of computation time for repetition and uncoded strategies is the maximum of non-identical but independent random variables, which produce the ripples in these curves (see Appendix for details).

A detailed analysis of the expected computation time for the competing strategies, i.e., uncoded strategy, repetition and MDS coding strategy is provided in the Appendix. Table 2 shows the order-sense expected computation time in the regimes where is linear and sub-linear in .

Strategy Expected Time linear in sub-linear in
Only one Processor
Uncoded (M divides P)2
Repetition (M divides P) 2
MDS
Short-Dot
  • Refer to Appendix for more accurate analysis taking integer effects into account

Table 2: Probabilistic Computation Times

Note that in the regime where is linear in , Short-Dot outperforms Uncoded Strategy by a factor diverging to infinity for large . Similarly, in the regime where is sub-linear in , Short-Dot outperforms MDS coding strategy by a factor that diverges to infinity for large . Thus Short-Dot universally outperforms all its competing strategies over the entire range of .

Now we explicitly provide a regime, where the speed-ups from Short-Dot diverges to infinity for large , in comparison to all three competing strategies - MDS Coding, Repetition or Uncoded strategies.

Theorem 4

Suppose scales as . Then, Short-Dot with has an expected computation time (scaled by ) as that decays to as . In contrast, the expected computation time (scaled by ) for MDS coding, repetition and uncoded strategies scale as and thus do not decay to as .

Proof: For the proof of this theorem, we simply substitute the values of and in the expressions of expected computation time as follows. We let for all the strategies. For uncoded strategy, we thus obtain,

(14)

For repetition, we obtain,

(15)

For MDS Coding based linear computation, we obtain,

(16)

     

Figure 6: Log of Expected Computation Time scaled by with where is the number of processors, under : Short-Dot offers speed-ups compared to uncoded, repetition and MDS coding that diverge for large .

Now, we consider the Short-Dot strategy with . Note that the inequality is satisfied for . Now let us calculate the expected computation time for Short-Dot.

(17)

Thus, the speed-up offered by Short-Dot in this regime is , and thus diverges to infinity for large , as illustrated in Fig. 6.

5 Encoding and Decoding Complexity

5.1 Encoding Complexity:

Even though encoding is a pre-processing step (since is assumed to be given in advance), we include a complexity analysis for the sake of completeness. Recall from Section 2 that we first choose an appropriate matrix of dimension , such that every square sub-matrix is invertible and all sub-matrices in the last columns are invertible. Now, for each of the columns of the given matrix , we perform the following.

Set
The set of indices that are 0 for the -th column of
Set
Solve for is a row vector.
Set is a column vector ( -th col of )

For each of the columns, the encoding requires a matrix inversion of size to solve a linear system of equations, a matrix-vector product of size and another matrix vector product of size .
The naive encoding complexity is therefore . Note that effectively there are only different column sparsity patterns for this particular design discussed in this paper. Thus, there are effectively unique s , and thus unique matrix inversions can suffice for all the columns, as sparsity pattern is repeated. Thus, the complexity can be reduced to

This is higher than MDS coding based linear computation that has an encoding complexity of , but it is only a one-time cost that provides savings in online steps (as discussed earlier in this section).

5.1.1 Reduced Complexity using Vandermonde matrices:

The encoding complexity can be reduced further for special choices of the matrix . Let us choose to be a Vandermonde matrix as given by

(18)

Here, , and are all distinct. This matrix satisfies all the requirements of the encoding matrix. All sub-matrices of are invertible, and all sub-matrices in the last columns are also invertible. Thus, this matrix can be used to encode the matrix . For each of the columns of , the encoding requires solving a linear system of equations for , as given by:

(19)

Here denotes a set of indices .
The matrix-vector product is equivalent to the evaluation of a polynomial of degree with the co-efficients as at arbitrary points given by . Once this product is obtained, the linear system of equations reduces to the interpolation of the unknown co-efficients of a polynomial of degree (which is ), from its value at arbitrary points as given by . Once is obtained, we perform the following operation.

(20)

This step is equivalent to the evaluation of a polynomial of degree at points given by . Thus we decompose our encoding problem for each column of into a bunch of polynomial evaluation and interpolation problems, all of degree less than . Now, from [kung1973fast], [li2000arithmetic], we know that both the interpolation and the evaluation of a polynomial of degree less than , at arbitrary points is . Thus, the complexity of encoding is .

5.2 Decoding Complexity:

During decoding, we get dot-products from the first processors out of . We then perform the following operations.

Set
Set
Set Col Vector of outputs from first processors
Solve for
Output: First values of

We solve a system of linear equations in variables and use only values of the obtained solution vector. Thus, effectively we do a single matrix inversion of size followed by a matrix-vector product of size . The decoding complexity of Short-Dot is thus which does not depend on when . This is nearly the same as complexity of MDS coding based linear computation.

5.2.1 Reduced Complexity using Vandermonde matrices:

Similar to encoding, using Vandermonde matrices can reduce the decoding complexity further. As already discussed, we choose the encoding matrix as a Vandermonde matrix as described in (18). The decoding problem consists of solving a a system of linear equations in variables.

(21)

Here is a set of indices . The problem of finding is equivalent to the interpolation of the co-efficients of a polynomial of degree , from its values at arbitrary points given by . Again, from [kung1973fast], [li2000arithmetic], the interpolation of a polynomial of degree , at arbitrary points can be done in , which thus becomes the decoding complexity.

6 Experimental Results

We perform experiments on computing clusters at CMU to test the computational time. We use HTCondor [HTCondor] to schedule jobs simultaneously among the processors. We compare the time required to classify handwritten digits of the MNIST [lecun1998mnist] database, assuming we are given a trained -layer Neural Network. We separately trained the Neural network using training samples, to form a matrix of weights, denoted by . For testing, the multiplication of this given matrix, with the test data matrix is considered. The total number of processors was .

Assuming that is encoded into in a pre-processing step, we store the rows of in each processor apriori. Now portions of the data matrix of size are sent to each of the parallel processors as input. We also send a C-program to compute dot-products of length with appropriate rows of using command condor-submit. Each processor outputs the value of one dot-product. The computation time reported in Fig. 7 includes the total time required to communicate inputs to each processor, compute the dot-products in parallel, fetch the required outputs, decode and classify all the test-images, based on experimental runs.

  

Figure 7: Experimental results: (Left) Mean computation time for Uncoded Strategy, Short-Dot (K=18) and MDS codes: Short-Dot is faster than MDS by and Uncoded by . (Right) Scatter plot of computation time for different experimental runs: Short-Dot is faster most of the time.
Strategy Parameter Mean STDEV Minimum Time Maximum Time
Uncoded 20 11.8653 2.8427 9.5192 27.0818
Short-Dot 18 10.4306 0.9253 8.2145 11.8340
MDS 10 15.3411 0.8987 13.8232 17.5416
Table 3: Experimental computation time of dot products ()

Key Observations: (See Table 3 for detailed results). Computation time varies based on nature of straggling, at the particular instant of the experimental run. Short-Dot outperforms both MDS and Uncoded, in mean computation time. Uncoded is faster than MDS since per-processor computation time for MDS is larger, and it increases the straggling, even though MDS waits for only for out of processors. However, note that Uncoded has more variability than both MDS and Short-Dot, and its maximum time observed during the experiment is much greater than both MDS and Short-Dot. The classification accuracy was on test data.

Comment:

The experimental times are quite high due to some limitations of the experimental platform used. The time includes some overhead to start the cluster, and communicate data in the form of text files to all the processors, and also collect the output data files back from all the processors. The read time also depends on the size of file to be read. Currently, we are looking at performing these experiments in alternate distributed computing platforms, with better communication protocols.

7 Discussion

7.1 Storage and Communication benefits of Shorter Dot Products:

The major advantage of using Short-Dot codes over the MDS coding strategy in [kananspeeding] is that the length of the pre-stored vectors (rows of ) as well as the communicated input (portions of ) is shorter than . It is thus applicable when processing units have limitations of memory, and it is not possible to pre-store the long vectors of length . Short-dot also has advantages over [kananspeeding] in systems where the principle bottlenecks in computation time is in communicating the input to all the processors, and it may not be feasible to broadcast (multicast) to all processors at the same time. Thus, it is also useful in applications where communication costs are predominant over computation costs.

7.2 Errors instead of erasures:

While we focus on the problem of erasures in this paper, Short-Dot can also be used to correct errors. Consider the scenario when instead of straggling or failures, some processors return entirely faulty or garbage outputs, in a distributed system and we do not know which of the outputs are erroneous. We argue from coding theoretic arguments that Short-Dot codes designed to tolerate stragglers, can also correct errors. First observe that if the code can tolerate stragglers, then the Hamming Distance between any two code-words should at least be . Hence, the number of errors that can be corrected is which is . The same result can also be derived by recasting the decoding problem as a sparse reconstruction problem, and borrowing ideas from standard compressive sensing literature [candes2005decoding] which also yields a concrete, decoding algorithm. The problem reduces to an minimization problem, which can be relaxed into an minimization, or solved using alternate sparse reconstruction techniques, under certain constraints on the encoding matrix .

7.3 More dot-products than processors

While we have presented the case of here, Short-Dot easily generalizes to the case where . The matrix can be divided horizontally into several chunks along the row dimension (shorter matrices) and Short-Dot can be applied on each of those chunks one after another. Moreover if rows with same sparsity pattern are grouped together and stored in the same processor initially, then the communication cost is also significantly reduced during the online computations, since only some elements of the unknown vector are sent to a particular processor.

Acknowledgments: Systems on Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by MARCO and DARPA. We also acknowledge NSF Awards 1350314, 1464336 and 1553248. S Dutta also received Prabhu and Poonam Goel Graduate Fellowship.

References

8 Appendix

We now provide a probabilistic analysis of the computational time required by Short-Dot and compare it with uncoded parallel processing, repetition and MDS code based linear computation as shown in Fig. 5. We assume that the time required by a processor to compute a single dot-product follows an exponential distribution and is independent of other parallel processors.

Let us assume, the time required to compute a single dot-product of length , follow the distribution:-

(22)

Here, is a straggling parameter, that determines the “unpredictable latency” in computation time. We also assume, that if the length of the dot-product is where is the sparsity of the vector, the probability distribution of the computational time varies as:-

(23)

Now we derive the expected computation time using our proposed strategy and compare it with existing strategies in the regimes where the number of dot-products is linear and sub-linear in .

Table 2 shows the order-sense expected computation time in the regimes where is linear and sub-linear in