A Sequential Approximation Framework for Coded Distributed Optimization

A Sequential Approximation Framework for Coded Distributed Optimization

Jingge Zhu, Ye Pu, Vipul Gupta, Claire Tomlin, Kannan Ramchandran
EECS, University of California, Berkeley, CA, USA.
Email: {jingge.zhu, yepu, vipul_gupta, tomlin, kannanr}@eecs.berkeley.edu
The work of Jingge Zhu was supported by the Swiss National Science Foundation under Project P2ELP2_165137. The work of Ye Pu was supported by the Swiss National Science Foundation under Project P2ELP2_165155.
Abstract

Building on the previous work of Lee et al. [2] and Ferdinand et al. [3] on coded computation, we propose a sequential approximation framework for solving optimization problems in a distributed manner. In a distributed computation system, latency caused by individual processors (“stragglers”) usually causes a significant delay in the overall process. The proposed method is powered by a sequential computation scheme, which is designed specifically for systems with stragglers. This scheme has the desirable property that the user is guaranteed to receive useful (approximate) computation results whenever a processor finishes its subtask, even in the presence of uncertain latency. In this paper, we give a coding theorem for sequentially computing matrix-vector multiplications, and the optimality of this coding scheme is also established. As an application of the results, we demonstrate solving optimization problems using a sequential approximation approach, which accelerates the algorithm in a distributed system with stragglers.

I Introduction

Emerging applications from social networks and machine learning make distributed computation systems increasingly important for handling large-scale computation tasks. In this framework, a large computation task is divided into several smaller sub-tasks, each of which is dispatched to a different processor. The computation results are then aggregated and processed to produce the final result. A central challenge to this approach is how to handle uncertainty caused by “system noise” (see, e. g. [1]). One notable phenomenon is the “straggler” effect, namely, the latency of a single processor could cause a significant delay in the whole computational task. In existing distributed computation schemes, various straggler-detecting algorithms have been proposed to mitigate this problem. For example, Hadoop detects stragglers when executing the computation. When it detects a straggler, it runs a copy of the “straggled” task on a different processor. However, running many replications of subtasks turn out to be inefficient.

A novel approach to mitigating uncertainty is to add controlled redundancy in the distributed computation tasks. Lee et al. [2] proposed a coded computation framework for computing matrix-vector multiplications in a distributed system. By using maximum distance separable (MDS) codes to encode the matrix and distribute smaller computation tasks to different processors, they show that coded computation can derive significant gains over naïve replication methods in terms of computation time. Based on the same idea, Ferdinand and Draper [3] proposed a refined coding scheme (called “anytime coding scheme”) where an approximation of the matrix-vector multiplication can be obtained in a timely fashion. We point out that the coded computation scheme has also been extended to study matrix multiplication problems [4], and is shown to be useful for reducing the communication overhead in distributed systems [5]. Furthermore, the idea of using codes in distributed systems has found various applications in machine learning problems, as shown in [6] [7] [8].

In this paper, we take a step further in studying how to tackle optimization problems using a coded computation approach. Building on the work of [2] and [3], we propose a sequential approximation method for solving optimization problems. The basic idea of this approach is that instead of directly solving the original problem, we solve a sequence of optimization problems (called approximations), whose solutions gracefully approach the solution of the original problem. These approximations need to be designed judiciously so that solving the approximate problems requires less computation time than solving the original problem. Consequently as we show in the sequel, in the presence of stragglers, the sequential approximation method typically takes less time to find the solution of the original problem. The saving on execution time is more significant if we only aim to find an approximate solution to the original problem. An attractive feature of the proposed method is that the processors in the distributed system are oblivious to different approximations, making it a user-centered design.

The driving mechanism for the proposed sequential approximation framework is a so-called coded sequential computation scheme, designed specifically for distributed computation systems with latency. It has the desirable property that the user is guaranteed to receive useful (approximate) computation results whenever a subset of processors finish their subtasks, even in the presence of uncertain latency. In this paper, we focus our study on a coding scheme for sequentially computing matrix-vector multiplication, which is a basic building block for most algorithms. We then show how to integrate our coded sequential computation scheme into the sequential approximation framework in order to accelerate the algorithm in the distributed computation system.

Ii The Sequential Approximation Method: an Overview

We consider solving an optimization problem of the following form

(1)

where is positive (semi)-definite matrix and is a closed, proper and convex function. The formulation in (1) represents a large class of problems of interests. For example, choosing to be the indicator function111The indicator function evaluates to if and to if . converts problem (1) into a constrained optimization problem with a qudratic objective function, where is a convex set. Choosing to be a norm of is also widely used in applications. For example, the choice converts (1) to the lasso problem.

Alternating methods are efficient optimization methods used to solve problems of the form (1). The proximal gradient method [9] and ADMM are two examples of the alternating methods. For instance, the proximal gradient method update the variable as

(2)

where and denote the variable and the step size in the -th iteration, respectively. The proximal operator prox is defined as

The update rule of ADMM is given as

(3)

where denotes the Langrangian multipler. To solve the first iteration in (3) using first order methods, we need to compute for each steps.

From the above expressions, it can be argued that the matrix-vector multiplications is (one of) the computationally most expensive operations222Indeed, for example the proximal operator prox could be very simple for many problems of interests. in this algorithm for each step . With the focus on the matrix-vector multiplication, we will denote the update rule in (2) or (3) simply as in the sequel.

In modern large-scale machine learning problems, the matrix can be very large so that computing the matrix multiplication (or even storing the matrix) in one processor is not feasible. In order to handle such large-scale problems, we turn to a distributed computation paradigm where the task of computing is collaboratively accomplished by several processors. As discussed in Introduction, the uncertainty (latency for example) of the individual processor could be detrimental to the distributed computation system and renders the distributed computation approach unusable. To alleviate this problem, previous works (e.g. [2] [3] [6]) have proposed coded computation schemes by adding redundancy in the computation tasks. We give a very simple example to illustrate the idea of coded computation. The matrix is vertically split into two smaller matrices . We use three processors to store separately, and each processor performs a smaller matrix-vector multiplication. It is easy to see that with any two of three multiplicaitons , the user is able to recover . The same idea can be applied to a general setting with more users.

In this work, we take a step further to combine a coded sequential computation scheme with a modified algorithm. In particular, we propose the sequential approximation algorithm shown in Algorithm 1 for solving the problem in (1). In contrast to the original algorithm, Algorithm 1 executes a sequence of approximated problems (called “approximations”). Each approximation is in the same form of the original problem, but with a different choice of the matrix . Notice that in order to obtain the correct solution in the end, the last approximation matrix should be equal to . Algorithm 1 should possess the following two properties to be useful:

  • 1) Executing the approximations should be faster than executing the original iteration in a distributed system.

  • 2) The matrix approaches the original matrix as increases.

Property 1) ensures the proposed algorithm is faster than the original algorithm in the approximation phases, and Property 2) guarantees that Algorithm 1 eventually provides a solution which is close or identical to the solution of the original problem.

are matrices which approximate with an increasing accuracy.
for  do
     
end for first approx. for iterations
for  do
     
end for second approx. for iterations  ⋮ ⋮
for  do
     
end for
Algorithm 1 A sequential approximation for solving (1). The function represents the updating rule in (2) or (3)

An illustration of the sequential approximation approach is given in Fig. 1. The black trajectory on the botten 333Notice that the actual trajectory of is not necessarily a straight line in . The plot is only an illustration. represents the path of when the variable is updated using the exact computation , and the colored “detour” represents the trajectory using the sequential approximation method.

Remark 1

It is natural to ask if the sequential approximation method in Algorithm 1 is already useful in a system without stragglers, where computing takes the same amount of time as computing . A preliminary investigation suggests that it will depend on both the algorithm and the optimization problem (e.g. the condition number of the matrix). For certain problems, the sequential approximation approach can indeed provide a better convergence rate even for systems without stragglers. The results will be reported in our future work.

\includegraphics

[scale=0.33]path5_v2.png

Fig. 1: An illustration of the sequential approximation approach. The black path on the botten represents the path of when the variable is updated using the exact computation , and the colored “detour” represents the trajectory using sequential approximation. During the -th approximation phase, the variable converges to the point , which is the optimal solution to the optimization problem (1) with replaced by . The sequential approximation approach could be faster in a distributed computation system if each iteration in the “detour” takes less computation time.

Iii Coding for Distributed Sequential Matrix-Vector Multiplication

A coded sequential computation scheme for the distributed system is the key mechanism behind the sequential approximation method. In this section, we formally introduce the coded sequential computation problem. It can be viewed as a general problem formulation of the anytime coding scheme studied in [3].

Consider a system with processors where each processor performs a matrix-vector multiplication of the form . The matrix is of dimension and is a -length vector. Let be matrices prescribed by the user where the matrix is of dimension . The goal is to compute the matrix-vector multiplications for a vector using these processors. To perform the computation in a distributed manner, matrices are generated based on the given matrices , namely

(4)

where denotes a mapping . The matrix is stored in the -th processor. To compute the multiplication, a vector is given to all processors, and each processor returns the result to the user when it finishes the computation. The user then applies a decoder to obtain desired multiplications using the received results . We would like to design our encoders and such that the system has the following property.

Property 1 (Sequential computation)

With the computation results from any processors (), for some , the user can recover where .

\includegraphics

[scale=0.5]system_verti.png

Fig. 2: An illustration for distributed sequential matrix-vector multiplication. matrices are generated based on the prescribed matrices . The user first sends a vector to all processors (top). Whenever a processor returns the computation result , the user can recover an additional matrix-vector multiplication (bottom), .

Figure 2 gives an illustration of the sequential computation scheme of the distributed computation system. If a coding scheme satisifes the above sequential computation property, the corresponding value is called a feasible configuration. The question is that given a distributed computation system with parameters 444As we shall see, the parameter can be chosen arbitrarily., what are the feasible configurations (possible values of ), and how do we design the encoders and the decoder for such a system.

This sequential computation scheme is useful for distributed computation systems with stragglers because it guarantees that any finished processor will provide useful results. Moreover, we could choose the matrices such that is more crucial than for our application if , such that “more important” results are received earlier. As pointed out in [3], this coding scheme can also be viewed as an approximation method for computing the multiplication , where the accuracy increases gradually as more and more processors finish their tasks.

Iii-a The coding scheme

In this section, we give a coding scheme for the sequential distributed matrix-vector multiplication problem. This scheme is a generalization of the MDS codes based scheme used in [2], and uses essentially the same idea as in [10] (multiple description coding) and [3].

Coding scheme: For each , we divid the matrix vertically into at most submatrices as follows

where denotes the -th row of the matrix . In the case when divides , we do not have the last matrix .

For each matrix where , we encode its rows to form a new matrix . Particularly, we use a systematic MDS code, such that the first rows of is identical to , and the last rows are linear combinations of the rows of (i.e., parity checks). If does not divide , the rows of the last matrix is encoded with an MDS code into a new matrix .

The matrices are generated using the encoded matrices . More precisely, each matrix contains exactly one row of the matrix , for and for all . If we have the extra matrix for certain , its rows are distributed to arbitrary matrices .

Example: We apply the above coding scheme to a system with parameters , and the configuration . There are three matrices and to encode ( is zero in this case). The proposed coding scheme generates matrices as follows:

It can be checked that we can recover if we have for any , recover if we have for any and recover with all the compuation results.

Theorem 1 (Coding scheme)

Consider the distributed sequential matrix-vector multiplication problem with parameters . The configuration is feasible if it satisfies

(5)

where is defined as

(6)
{proof}

The proof is given in Appendix.

Remark 2

We point out that this result is a generalization of the coding scheme using a single MDS code proposed in [2], which can be seen as a special configuration with for some .

Remark 3 (Complexity of decoding)

With the computation results from processors, the decoding process at the user is equivalent to solving a linear system of at most unknowns, which does not depend on (number of columns of the matrices). Hence this coding scheme is most beneficial for computing matrix-vector multiplications when the number of rows of matrices is very large. Moreover, using MDS codes with special structures (Reed-Solomon codes for example), the decoding process is often much simpler than solving a generic linear system.

We can show that if we restrict ourselves to linear coding schemes (i.e., the encoder in (4) is a linear function of ), then the coding method in Section III-A is the best possible. This result establishes the optimality of our coding scheme in the previous section.

Theorem 2 (Converse for linear schemes)

Consider the distributed sequential matrix multiplication problem with parameters . Under linear coding schemes, any feasible configuration must satisfy the constraint (5).

{proof}

The proof is given in Appendix.

Iv The Sequential Approximation Method: Examples

Equipped with the coded sequential computation scheme described in Section III, the sequential approximation method in Algorithm 1 can be implemented where the matrix-vector multiplication is computed in a sequential manner. We demonstrate this method by considering the Lasso problem in its standard form

(7)

for a matrix and . This corresponds to the optimization problem in (1) by identifying and . The corresponding proximal gradient method for this problem is given by

(8)

where is the soft-thresholding operator defined as

For this problem, the soft-thresholding operator is very simple, and the most computationally expensive step in the algorithm is the matrix-vector multiplication for each step .

Iv-a Approximations

Instead of computing at each step , we use the proposed sequential distributed computation scheme. Similar to [2], we frist focus on computing the term . Using Algorithm 1 to solve the above problem requires a specification of . A priori, could be chosen in any way, as long as the multiplication requires less computation than . Similar to [3], in this paper we choose to be a low rank approximation of using singular value decomposition:

where denote the singular values of with rank . are the -th column of and , respectively. In particular, we choose as

(9)

for some . Namely captures the largest singular values of .

Define for a chosen configuration which satisfies the condition (5). Using the coding scheme described in Section III-A, we generate matrices based on for the processors.

1: is the encoded matrix stored in processor
2:Receive an input vector
3:Compute
4:Send back to the user
Algorithm 2 Subroutine for computing : processor
1:Send a vector to all processors
2:Wait until processors finish, where
3:Decode using from processors
Algorithm 3 Subroutine for computing : user

When executing the algorithm, the vector is given to all processors at each time step . The coding scheme guarantees that with the computation results from any processors , the user can recover the multiplication result

(10)

where we define . An extra multiplication gives :

(11)

where in this case. We point out that in the lasso problem, the matrix in general has much more columns than rows, hence the above multiplication (11) takes less computation. Moreover, it can also be done using the distributed system in a similar way. By treating as the vector to be multiplied, the user can distribute the multiplication in the same way. Hence there are two computation steps for each iteration. We omit the details of the second step in this paper.

The subroutines for processors and for the user are given in Algorithm 2 and Algorithm 3, respectively. We point out that the subroutine for processors does not change for different approximation level (different ), and only the user needs to adjust its procedure to adapt to different approximation levels.

Iv-B Computation time

As mentioned in Section II, the reason for adopting the sequential approximation method is that it is faster to obtain a low rank approximation than obtaining the exact answer in a distributed system with stragglers. More precisely, let denote the random computation time of processor and let denote the -th order statistic, i.e.

The time for recovering the result is given by where satisfies

while computing the exact answer requires time where satisfies

Notice that we always have . In a distributed system with stragglers, could be significantly larger than if is larger than .

In summary, although Algorithm 1 starts with “incorrect” iterations with approximation matrices , the computation time for those iterations are shorter than using the exact matrix . If we choose approximations judiciously (approaching gradually), the variable will approach the optimal solution, but with a shorter computation time.

Iv-C Choices of parameters

There are many free parameters to choose for the sequential approximation algorithm, including the approximation matrices , the number of different approximation levels , and the number of iterations . They should be chosen in a way such that the algorithm can be implemented with the given system parameters . In other words, if we choose as in (9), there should exist a feasible configuration which both satisfy condition (5) and the condition

(12)

for every with some satisfying if . Recall that these choices will affect the running time of each iteration as shown in Subsection IV-B, hence we could optimize the overall execution time with respect to these parameters.

More importantly, choosing as (9) is only one way to approximate the matrix . There may exist other choices of which provide a better convergence performance. The problem of choosing good approximations will be addressed in the future work.

Iv-D Numerical results

In this section, we present numerical results to demonstrate the sequential approximation method with the Lasso problem in (7). Specifically, we consider the proximal gradient algorithm in (8).

\includegraphics

[scale=0.45]example1_simulation3.png

Fig. 3: Numerical results for Example 1, showing the normalized suboptimality verses overall computation time. The solid line denotes the sequential approximation algorithm and the dashed line denotes the original proximal gradient method. A black circles denotes the end of one approximation level. To reach a normalized suboptimality , the sequential approximation is roughly faster than the original algorithm.

As a toy example, we assume that the distributed system has processors where each processor can compute a matrix-vector multiplication with , namely . The matrix in the Lasso problem has dimension with rank and . In our simulations, the matrix is chosen randomly, and the regularization coefficient is chosen to be . The computation time of each processor is assumed to have an exponential distribution with the density function with the choice .

\includegraphics

[scale=0.45]example2_simulation3.png

Fig. 4: Numerical results for Example 2, showing the normalized suboptimality verses overall computation time. The solid line denotes the sequential approximation algorithm and the dashed line denotes the original proximal gradient method. In this example, the algoritm never uses the exact matrix , hence does not converge to the optimal solution of the original proble, but an approximate soultion with a suboptimality . In this case, the sequential approximation algorithm is almost two times faster than the original algorithm to reach this suboptimality.

Example 1: We use two approximation levels where are chosen according to (9) with and (i.e., ). It can be checked straightforwardly this particular choice can be implemented with a feasible configuration . With this choice, the matrix-vector multiplication is obtained when three processors finish their tasks. It can be seen that if the user needs the exact result , he must wait for all four processors to finish. It can be calculated that the average waiting time for three processors is approximately , while the averate waiting time for four processors is . The performance of the algorithm is given in Figure 3.

Example 2 (approximate solution): This approach is also useful when we only aim to obtain an approximate solution of the problem. In this example we execute the sequential approximation algorithm with two levels of approximation where and . This is implemented with a configuration . In other words, the user can recover when any processor returns the result and can recover when any processors return the results. The average waiting time for one processor is approximately and the average waiting time for two processors is approximately . Notice that in this case, the sequential approximation algorithm cannot converge to the true minimizer since it does not use the exact matrix in any level. Nevertheless the simulation results show that it gives a fairly good approximate solution. The performance of the algorithm is given in Figure 4.

V Appendix

{proof}

[Proof of Theorem 1] It is easy to see that each matrix is encoded into at most matrices , whose total number of rows sums up to as defined in (6). Hence all the encoded data can be accommodated in the matrices if it satisfies .

Now we argue that this coding scheme possesses the sequential property in Property 1. Since the rows of are encoded with an MDS code into the matrix , it is to see that any entries of the matrix-vector multiplication allow us to recover . Recall that each row of is distribtued to one processor, hence if processors return the their matrix-multiplication results, the user can recover , and this holds for all .

If does not divide , we have the extra matrix whose rows are encoded using an MDS codes into with rows, and each row is distrbued to one processor. It can be seen that any processors will contain at least rows of . Indeed, there are processors which do not store any row of . The worst case is when a subset of processors include all the processors which do not contain rows of . Even in this case, we have the remaining processors which contain rows of . This shows that it is always possible to recover and (hence ) with any processors. This argument holds for all and thus concludes the proof.

{proof}

[Proof sketch of Theorem 2] We will only consider linear coding scheme in this paper where the matrix is a linear function of matrices . It can be argued that, any linear coding scheme can be reduced to a scheme where each row of only consists of rows of one matrix (as shown in the Example in Section III-A). In other words, any linear scheme can be equivalently implemented as

(13)

where is of the form

for some where .

Since each row of is used for encoding only one matrix , we use to denote the number of rows used for encoding in processor (which contains ). Let be the number of rows used for the matrix across all processors, i.e., . Now we show that in order to have a valid coding scheme for the distributed storage problem, should satisfy

which matches the achievable coding scheme in (6).

First notice that in order to recover using any processors, a necessary condition on is

(14)

for any subset with , simply because has length . Hence a lower bound on is given by the following optimization problem

minimize (15)
s . t.
(16)

If we ignore the integer constraint on , it is easy to see that the optimal solution to the relaxed linear programming problem is given by

and the lower bound is equal to . In the special case when divides , the optimal solution is an integer hence is also an optimal solution to the original problem (15). This shows that we have if divides .

If does not divide , it can be argued that, due to the complete symmetry of the problem (15), the optimal solution of the integer programming problem (15) satisfies

Moreover, at most among the processors are allowed to dedicate rows to , and all other processors must dedicate rows to . Indeed, if we have for all for some set with for some , then for a set we have

hence not be able to recover . We conclude that must satisfy

This shows a lower bound on for the case when does not divides .

References

  • [1] J. Dean and L. A. Barroso, “The tail at scale,” Communications of the ACM, vol. 56, no. 2, p. 74, Feb. 2013.
  • [2] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding Up Distributed Machine Learning Using Codes,” arXiv:1512.02673 [cs, math], Dec. 2015, arXiv: 1512.02673. [Online]. Available: http://arxiv.org/abs/1512.02673
  • [3] N. S. Ferdinand and S. C. Draper, “Anytime coding for distributed computation,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Sep. 2016, pp. 954–960.
  • [4] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication,” arXiv:1705.10464 [cs, math], May 2017, arXiv: 1705.10464. [Online]. Available: http://arxiv.org/abs/1705.10464
  • [5] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded MapReduce,” in 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), Sep. 2015, pp. 964–971.
  • [6] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient Coding,” arXiv:1612.03301 [cs, math, stat], Dec. 2016, arXiv: 1612.03301. [Online]. Available: http://arxiv.org/abs/1612.03301
  • [7] S. Dutta, V. Cadambe, and P. Grover, “Short-Dot: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds.   Curran Associates, Inc., 2016, pp. 2100–2108.
  • [8] C. Karakus, Y. Sun, and S. Diggavi, “Encoded distributed optimization,” in International Symposium on Information Theory (ISIT), 2017.
  • [9] N. Parikh and S. Boyd, “Proximal Algorithms,” Found. Trends Optim., vol. 1, no. 3, pp. 127–239, Jan. 2014.
  • [10] R. Puri and K. Ramchandran, “Multiple description source coding using forward error correction codes,” in Conference Record of the Thirty-Third Asilomar Conference on Signals, Systems, and Computers, vol. 1, Oct. 1999, pp. 342–346 vol.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
1021
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description