Gradient Coding via the Stochastic Block Model

Gradient Coding via the Stochastic Block Model

Zachary Charles University of Wisconsin-Madison
Department of Electrical and Computer Engineering
Madison, WI 53706
Email: zcharles@wisc.edu
   Dimitris Papailiopoulos University of Wisconsin-Madison
Department of Electrical and Computer Engineering
Madison, WI 53706
Email: dimitris@ece.wisc.edu
Abstract

Gradient descent and its many variants, including mini-batch stochastic gradient descent, form the algorithmic foundation of modern large-scale machine learning. Due to the size and scale of modern data, gradient computations are often distributed across multiple compute nodes. Unfortunately, such distributed implementations can face significant delays caused by straggler nodes, i.e., nodes that are much slower than average. Gradient coding is a new technique for mitigating the effect of stragglers via algorithmic redundancy. While effective, previously proposed gradient codes can be computationally expensive to construct, inaccurate, or susceptible to adversarial stragglers. In this work, we present the stochastic block code (SBC), a gradient code based on the stochastic block model. We show that SBCs are efficient, accurate, and that under certain settings, adversarial straggler selection becomes as hard as detecting a community structure in the multiple community, block stochastic graph model.

I Introduction

In order to scale up machine learning to large data sets, we often use distributed implementations of traditional first-order optimization algorithms. While distributed algorithms can theoretically achieve substantial speedups, in practice the parallelization gains often fall short of the optimal speedup gains predicted by theory[1, 2]. One of the sources of this phenomenon is the presence of stragglers. These are compute nodes whose runtime is much higher than the runtime of most other nodes in the system.

There has been significant recent work in reducing the effect of stragglers in distributed algorithms. Current approaches include replicating jobs across nodes[3] and dropping stragglers in settings where the system can tolerate errors [4]. More recently, coding theory has gained traction as a way to speed up distributed. It has been used to reduce the runtime of the shuffling phase of MapReduce [5], make distributed matrix multiplication more efficient [6], and reduce the effect of stragglers in certain machine learning computations [7].

Gradient coding, a straggler-mitigating technique for distributed gradient-based methods, was first proposed in [8] and was later extended in [9]. While gradient coding was initially used for exact reconstruction of the sum of gradients, it was later extended to approximate reconstructions using expander and Ramanujan graphs [9]. Such graphs can be expensive to compute in practice, especially for large numbers of compute nodes, and the approximation error that they introduce during the gradient recovery problem can be large.

Some efficient and simple approximate gradient codes were studied in [10], including fractional repetition codes (FRCs) and Bernoulli gradient codes (BGCs). FRCs were first presented in [8], but only for exact gradient coding. FRCs achieve very small approximation error when the straggler are chosen at random, but suffer when the stragglers are selected adversarially. However, [10] shows that adversarial straggler selection is NP-hard in general, which implies the existence of gradient codes that might be robust to adversarial stragglers under polynomial-time bounded adversaries assuming some widely accepted computational conjectures.

The main problem we tackle in this paper is whether it is possible to design gradient codes that 1) are efficiently computable, 2) achieve reconstruction error similar to FRCs for random stragglers, 3) can be resistant to adversarial stragglers.

I-a Our Contributions

In this work, we present the stochastic block code (SBC), a new gradient code that combines the small reconstruction error of FRCs under random straggler selection with the randomness of BGCs, which is some cases leads to robustness against adversarial stragglers. SBCs are based on the stochastic block model from random graph theory. The code is more effective than BGCs when stragglers are chosen randomly, but more difficult for an adversary to attack than FRCs. We give explicit bounds on the approximation error of SBCs under random straggler selection and show that certain adversarial attacks on SBCs are computationally as hard as recovery problems in community detection for regimes where no polynomial time algorithm is known to exist. Finally, we empirically evaluate SBCs and show that they are capable of achieving small approximation error, even when a constant fraction of the compute nodes are stragglers.

Ii Preliminaries

In this work we use standard script for scalars and bold for vectors and matrices. Given a matrix , let denote its entry and denote its th column. Let denote the all ones vector, while let denote the all ones matrix. We define and analogously. Given a matrix , let denote its pseudoinverse.

We consider a distributed master-worker setup of compute nodes, each of which is assigned tasks. A pictorial description is given in Figure 1. Each compute node locally computes a set of assigned functions and sends a linear combination of these functions to the master node.

Fig. 1: A master-worker distributed system where each compute node has multiple cores.

The goal of the master node is to compute the sum of functions

(1)

in a distributed way, where and each can be assigned to and computed locally by any of the compute nodes. Here, . We denote the output of compute node by .

Due to the straggler effect, we assume that the master only has access to the output of non-straggler compute nodes. By not waiting for the output of straggler nodes, we can drastically improve the runtime of distributed algorithms. If we wish to exactly recover , then we need [8]. However, in practice we may only need to approximately recover , which we can do with significantly fewer non-straggler nodes.

The above setup is relevant to distributed learning algorithms, where we often wish to find some model by minimizing

Here are our training samples and is a loss function measuring the accuracy of the model with respect to the data point . In order to find a model that minimizes the sum of losses , we often use first-order, or gradient-based, methods. In these algorithms, at every distributed iteration, we would need to compute the gradient of (or a mini-batch of the training samples), given by Letting , we arrive at the setup in (1). Note that this kind of setup applies to both mini-batch SGD and full-batch gradient descent.

Ii-a Gradient Codes

A gradient code involves a choice of function assignments per compute node, messages sent from the compute node to the master, and a decoding method used by the master node to approximately reconstruct the true gradient. For simplicity, we assume that the master node can only compute linear combinations of the output of the non-straggler compute nodes.

Suppose we have functions to compute and compute nodes. We wish to construct a function assignment matrix that describes which functions each node computes and what message the node passes back to the master node. The column corresponds to compute node . The entries of column correspond to the coefficients of the linear combination of functions that the compute node sends back to the master node. In other words, if column has support , then node computes for and has output (or more generally, a linear combination of the computed functions). Let .

For the decoding, we assume that we have non-straggler nodes with indices given by . Decoding the gradient code corresponds to designing some vector with support in . The approximation to is then given by

We define the error of our approximation in terms of the vector . Note that we have

Since the error depends on which we do not know a priori, we instead define the approximation error of our gradient code by

where has support only among the non-straggler nodes.

Let denote the indices of the non-straggler columns. We define the non-straggler matrix as having the same entries as in the non-straggler columns and having 0 in the straggler columns. This implies The optimal decoding vector is defined by

Standard facts about the pseudoinverse of a matrix then imply This is one of the possibly infinite many optimal solutions to the above least squares recovery problem. In practice, it may not be feasible to compute since it involves computing the pseudo-inverse of a matrix. Moreover, is difficult to analyze theoretically since it involves the pseudoinverse of a random matrix. For these two reasons, we will often use and analyze simpler, more efficient decoding methods.

Iii Gradient Coding Methods

In this section, we briefly discuss two previously proposed gradient codes and their properties.

Iii-a Fractional Repetition Codes

Fractional repetition codes (FRCs) were first proposed in [8] for exact reconstruction of . They were later shown in [10] to be useful in the context of approximate gradient coding. Suppose we are given tasks and workers and a desired number of tasks per worker, where we assume without loss of generality. We define the function assignment matrix of an FRC by

Note that if columns 1 and 2 are both non-stragglers, then using both columns does not help our decoding. To decode, we first look at indices . If any are non-stragglers, we select the first such index and let . We repeat this with indices , etc. As long as there is a non-straggler in each of these blocks, a simple calculation shows that .

FRCs have small error with high probability, if the stragglers are selected uniformly at random, even with a constant fraction of stragglers [10]. However, the worst case error of FRCs is quite large, as we discuss in Section IV-C. To remedy this, we use gradient codes that employ randomness.

Iii-B Bernoulli Gradient Codes

Bernoulli gradient codes (BGCs) were first introduced in [10]. We construct where is Bernoulli with probability . We typically take , so that each node computes roughly functions. While we could use optimal decoding and find , there are more efficient (though slightly less accurate) decoding methods. If is the set of non-stragglers where , then somewhat small approximation error can be achieved by setting if , and 0 otherwise. In [10], it was shown that under this decoding, with high probability the error satisfies . While less accurate than FRCs, BGCs do not seem to be as vulnerable to adversarial straggler selection.

Iv Stochastic Block Codes

Our goal is to construct a gradient code that combines the small error to random stragglers of FRCs and the potential resilience against polynomial-time adversaries of BGCs. We propose a kind of interpolation between the two codes that we refer to as the stochastic block code (SBC). For such codes, we use a function assignment matrix with block structure

(2)

Here, each is an matrix with Bernoulli random entries. The entries of each are Bernoulli with probability , while the entries of each are Bernoulli with probability . In other words, has a stochastic block model structure with partitions of size with probabilities within a community and probabilities between communities. In the following, we will assume .

Typically we will consider the case where is close to and is small, so that has blocks similar to on the diagonal and only a small number of ones outside of the diagonal blocks. This model specializes FRCs when and BGCs when . By varying and , we can interpolate between FRCs and BGCs.

Iv-a Decoding

We would like to devise a computationally efficient method for decoding. Let . Given the set of non-straggler indices , let . Our decoding method works as follows. For each , we will select at random. We will then add up the columns to form a vector and then scale so that it is close to . The idea is that if we select one column from each block, then their sum should be close to a multiple of the all ones vector. A formal algorithm is given in Algorithm 1.

Input : The indices of non-straggler nodes.
Output : A vector that describes the decoding strategy.
1 ;
2 if  then
3        ;
4       
5 else
6        ;
7       
8 end if
9for  to  do
10        if  then
11               Pick uniformly at random;
12               ;
13              
14        end if
15       
16 end for
17return ;
Algorithm 1 Stochastic block decoding.

Recall that if we want to use to compute an approximation to , we take , where is the vector of outputs of the nodes. Since only has non-zero indices corresponding to non-straggler nodes, we can compute only using the non-straggler nodes.

This decoding corresponds to selecting one non-straggler column (if it exists) from each block of columns and then adding these columns together. We select at most columns. Let denote the sum of the selected columns. We return , where is the expected value of each entry of . When , this method gives us the optimal decoding method for FRCs discussed above.

When and , scaling by is not necessarily beneficial. In this regime, each entry of is 1 with high probability, while a few entries may be larger integers if any of the entries are non-zero. Scaling by then introduces errors in to all the entries that are in . Intuitively, if is 1 with high probability and 2 with small probability, the variable has expected value 1, but will never equal 1, while will equal 1 with high probability. Thus, not scaling by implies with high probability and has the added benefit of simplifying the analysis.

One could improve this algorithm by averaging over the non-straggler columns in each , and then adding up the averages. When is large, this approach is better than stochastic block decoding, at the expense of being more difficult to analyze theoretically.

Iv-B Random Stragglers

Suppose we use an SBC with stochastic block decoding and that the non-straggler indices are selected uniformly at random from all subsets of of size . We are particularly interested in the setting where the number of stragglers is a constant fraction of , as this is when previous theory such as that in [8] breaks down.

Because of the straggler effect, we only have access to the matrix where zeros have been filled in to all straggler columns. Note that is a random matrix, but even fixing , is random. While the worst-case error may still be large, as it is for FRCs, the average case is much better. We have the following theorem about the reconstruction error.

Theorem IV.1

Suppose that , and that for . If we apply stochastic block decoding to a SBC with randomly selected stragglers, then with probability at least , this produces a vector such that

All our proofs can be found in the long version of this paper [11].

Letting , we get the following corollary concerning SBCs that are close to FRCs.

Corollary IV.2

Suppose that and for . If the stragglers are selected randomly, then with probability at least , applying stochastic block decoding method to such a SBC produces a vector such that

The above theorem required . When , we derive the following theorem.

Theorem IV.3

Suppose that and that
. If the stragglers are selected randomly, then with probability at least , applying stochastic block decoding to a SBC produces a vector satisfying

By setting , we can analyze the error of BGCs under stochastic block decoding.

Corollary IV.4

Suppose that and . If the stragglers are selected randomly, then with probability at least , applying the stochastic block decoding method to a BGC with probability produces a vector satisfying

Iv-C Adversarial Stragglers

Suppose now that we have non-stragglers, but they are selected adversarially by some polynomial-time adversary. When and , an adversary trying to maximize would try to select entire blocks from with the same function assignments. This strategy is efficient for an adversary to compute, even if the adversary views a permuted , and leads to a worst-case error of [10].

The deterministic block structure of when is what allows an adversary to so easily find a worst-case set of stragglers. In general, the task of selecting stragglers to maximize is NP-Hard [10]. When and , an adversary could still achieve relatively high error by selecting adversaries from contiguous blocks in the block structure in (2). This task is equivalent to that of community detection in the stochastic block model.

Definition IV.5

We say that a community detection algorithm has accuracy if at most nodes are misclassified with probability tending to 1 as .

We say that a community detection algorithm has exact recovery if it has accuracy and that it has weak recovery if it has accuracy for . Weak recovery means that the algorithm performs better than randomly assigning community labels.

There has been significant work in understanding when exact and weak recovery are possible. For exact recovery, suppose that we have the stochastic block model in (2) with . Then exact recovery is possible exactly when [12]. This implies the following theorem.

Theorem IV.6

Suppose we use a stochastic block code with probabilities . Then there is no algorithm that can determine the block structure in (2) with accuracy as long as

Thresholds for weak recovery are also known, but not in as much generality. In [13], the authors define the signal-to-noise ratio of a community detection problem with and communities by

Suppose that and we have 2 communities. Then weak reconstruction is possible exactly when [14]. For communities, weak recovery is possible when However, the converse does not necessarily hold [13]. In general, community detection seems to be more difficult for smaller values of . This leads to the following theorem.

Theorem IV.7

An adversary can determine the block structure in (2) with accuracy better than random iff weak recovery is possible. In particular, for , this is possible if and only if

We would like to note that although there exist regimes for which identifying the community structure in a SBC can be computationally intractable, we have not identified a set of parameters such that worst case straggler detection is NP-hard for SBCs, while average case reconstruction error is small. We however think that the connection to community detection is a very interesting direction that can lead to such results.

V Empirical Results

In this section, we compare the empirical error of SBCs for different settings of and using the stochastic block decoding method and using the optimal decoding method. We are interested in how the error changes as a function of the number of non-stragglers . Recall that the error of a decoding vector is given by We compare this to the uncoded error, which equals the fraction of stragglers.

In Figure 2, we plot where comes from stochastic block decoding. We use varying values of and set so that the expected number of non-zero entries equals . As our theory suggests, the error increases roughly proportionally to for reasonable . As approaches , has a smaller effect on the error.

In Figure 3, we plot for the same values of and . In this case, the error is less sensitive to than in stochastic block decoding. Moreover, even when , we achieve substantially smaller error than in the uncoded case.

(a)
(b)
Fig. 2: A plot of the average error over 5000 trials. We take , for varying fractions of stragglers . The figure on the left has while the figure on the right has .
(a)
(b)
Fig. 3: A plot of the average error over 5000 trials. We take , for varying fractions of stragglers . The figure on the left has while the figure on the right has .

Vi Conclusion

In this work, we presented a new gradient code, the stochastic block code. The code is efficiently computable and we provided an efficient decoding method for approximate gradient recovery. The code can be tuned according to whether we care more about random or adversarial stragglers and interpolates between previously proposed codes like FRCs and BGCs. We gave theoretical bounds on the error of SBCs and showed empirically that it can achieve close to zero error in the presence of large numbers of random stragglers. We also show that SBCs can be more robust to adversarial stragglers than FRCs.

Our main results consider a simplified decoding method. In some settings, optimal decoding may be computationally feasible. Understanding optimal decoding error could lead to improved gradient codes. Other open questions involve determining lower bounds on the error of gradient codes and finding optimal codes for a given column sparsity of the function assignment matrix.

References

  • [1] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le et al., “Large scale distributed deep networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1223–1231.
  • [2] H. Qi, E. Sparks, and A. Talwalkar, “Paleo: A performance model for deep neural networks,” in International Conference on Learning Representations, 2017. [Online]. Available: {}{}}{https://openreview.net/forum?id=SyVVJ85lg}{cmtt}
  • [3] N.~B. Shah, K.~Lee, and K.~Ramchandran, ``When do redundant requests reduce latency?'' IEEE Transactions on Communications, vol.~64, no.~2, pp. 715--722, 2016.
  • [4] G.~Ananthanarayanan, A.~Ghodsi, S.~Shenker, and I.~Stoica, ``Effective straggler mitigation: Attack of the clones.'' in NSDI, vol.~13, 2013, pp. 185--198.
  • [5] S.~Li, M.~A. Maddah-Ali, and A.~S. Avestimehr, ``Coded mapreduce,'' in Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on.   IEEE, 2015, pp. 964--971.
  • [6] S.~Dutta, V.~Cadambe, and P.~Grover, ``Short-dot: Computing large linear transforms distributedly using coded short dot products,'' in Advances In Neural Information Processing Systems, 2016, pp. 2100--2108.
  • [7] K.~Lee, M.~Lam, R.~Pedarsani, D.~Papailiopoulos, and K.~Ramchandran, ``Speeding up distributed machine learning using codes,'' in Information Theory (ISIT), 2016 IEEE International Symposium on.   IEEE, 2016, pp. 1143--1147.
  • [8] R.~Tandon, Q.~Lei, A.~G. Dimakis, and N.~Karampatziakis, ``Gradient coding,'' arXiv preprint arXiv:1612.03301, 2016.
  • [9] N.~Raviv, I.~Tamo, R.~Tandon, and A.~G. Dimakis, ``Gradient coding from cyclic mds codes and expander graphs,'' arXiv preprint arXiv:1707.03858, 2017.
  • [10] Z.~Charles, D.~Papailiopoulos, and J.~Ellenberg, ``Approximate gradient coding via sparse random graphs,'' arXiv preprint arXiv:1711.06771, 2017.
  • [11] Z.~Charles and D.~Papailiopoulos, ``Gradient coding via the stochastic block model,'' 2018. [Online]. Available: https://tinyurl.com/ydyyb252
  • [12] E.~Mossel, J.~Neeman, and A.~Sly, ``Reconstruction and estimation in the planted partition model,'' Probability Theory and Related Fields, vol. 162, no. 3-4, pp. 431--461, 2015.
  • [13] E.~Abbe, ``Community detection and stochastic block models: recent developments,'' arXiv preprint arXiv:1703.10146, 2017.
  • [14] E.~Abbe and C.~Sandon, ``Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery,'' in Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on.   IEEE, 2015, pp. 670--688.
  • [15] D.~S. Mitrinovic, J.~Pecaric, and A.~M. Fink, Classical and new inequalities in analysis.   Springer Science & Business Media, 2013, vol.~61.
  • [16] E.~J. Candes and Y.~Plan, ``A probabilistic and ripless theory of compressed sensing,'' IEEE Transactions on Information Theory, vol.~57, no.~11, pp. 7235--7254, 2011.

Appendix A Proof of Main Results

We wish to prove the results in Section IV-B. Recall that we have a function assignment matrix with structure as given in (2). We only have non-straggler columns to use in the stochastic block decoding method in Algorithm 1. We let denote the indices of the non-stragglers. Let and let and define . We first wish to bound the probability that each is non-empty. This will allow us to get better bounds on the decoding error, as achieving zero error in the FRC case is only possible when each .

Lemma A.1
Proof:

Fix . Note that iff all non-stragglers are selected from , which has size . Therefore,

Taking a union bound over all , we arrive at the desired result.\qed

This probability is unfortunately unwieldy to use. When is large enough, however, we can greatly simplify this probability, as in the following lemma.

Lemma A.2

Suppose that . Then

Proof:

By Lemma A.1, we have

Manipulating, we get

Therefore,

The right-hand side is at most if

(3)

Since , we have

(4)

Letting , (4) implies that (3) holds if

Since this occurs for all , the desired result is shown. \qed

In the following, we will let denote the event that for all .

By Lemma A.2, with probability at least , each . Therefore, the vector has an entry of in locations, each corresponding to a different block of . Let denote these columns of , and let . Therefore,

This implies

(5)

We now consider two different regimes of . First, we will analyze the setting where for . We will show that in this setting, we can analyze the entries of directly. We first show that for this regime of , the entries of are not too large.

Lemma A.3

Suppose that and . Then

Proof:

Conditioning on , we have as in (5). By the block structure of in (2), we therefore know that each entry of is the sum of a random variable and random variables, divided by . Since , it suffices to show that with high probability, for each entry, the and that the sum of the corresponding random variables is not too large. Note that the probability that some equals 0 is, by the union bound, at most . By assumption on , this is at most .

We now bound the probability of having too many non-zero Let be the sum of random variables. Note that . By the Chernoff bound, with probability at most This implies that the th entry of is less than with the same probability, and so the th entry of is at most with the same probability (conditioned on ). Taking a union bound over all entries, and with the probability that all the as above, we get the desired result. \qed

We will now prove Theorem IV.1.

Proof:

Again, conditioning on , we have that as in (5). Note that since , we know that and so . By the block structure in (2), this implies that the th entry of is the sum of a random variable and random variables. We wish to bound the number of non-zero entries of .

Let denote the th entry of . Then a sufficient condition for is that the is 1 and all the are 0. We will bound the probability that this happens. For simplicity of notation, we will consider the case where there are random variables. Clearly, the is 0 with probability , and the sum of the is larger than 0 with probability . By Bernoulli’s inequality [15], this is at most . Therefore,

(6)

Therefore, the expected number of non-zero entries of is at most . Let . We will consider two cases, depending on .

Case 1: .

We now wish to bound the number of non-zero entries in with high probability. Since the entries are independent and any given entry is non-zero with probability at most , the Chernoff bound implies that for any , with probability at most . Taking , this implies that

The second inequality follows from the fact that . Therefore,

(7)

Case 2: .

Again, we wish to bound the number of non-zero entries in with high probability, again using the Chernoff bound. We now use the fact that for , with probability at most . Let . By assumption on , . Plugging this in to the Chernoff bound implies

Note that by assumption on . Therefore,

(8)

Let . Combining Lemma A.3 with (7) and (8), we get

Since holds with probability at least a straightforward conditional probability calculation shows

\qed

Corollary IV.2 follows by setting and noting that when , the maximum in Theorem IV.1 equals 3.

Next, we consider the setting where . In order to prove Theorem IV.3, we will use a version of the vector Bernstein inequality to derive high probability bounds on . We will use a simplified version which can be found in [16].

Theorem A.4 (Vector Bernstein inequality)

Let be a finite sequence of independent random vectors. Suppose that a.s. and let . Then for all satisfying ,

We will now state and prove a more general version of Theorem IV.3. Recall that in stochastic block decoding, we scale by if this quantity is at least 2.

Theorem A.5

Suppose that and . With probability at least , applying stochastic block decoding to a SBC produces a vector satisfying

Proof:

Suppose that event holds, that is, all . Let the vectors be the columns corresponding to the non-zero entries of (provided each ), as in (5). Note each has expected value given by a distinct column of the matrix in (2). In particular, for a given , has entries of and entries of . By the block structure in (2), this implies

Therefore,

(9)

Note that but by direct computation,

Let . Note that if , then , so we can apply Theorem A.4. This then implies that

(10)

Note that this was conditioned on the event that for all . Combining (9) and (10), we find that if holds, then with probability at least ,

Therefore, the probability that holds and that this bound on holds is at least . \qed

Note that as long as , . Plugging this in to Theorem A.5, we derive Theorem IV.3. Setting , we arrive at Corollary IV.4.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
199975
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description