Straggler Mitigation by Delayed Relaunch of Tasks
Redundancy for straggler mitigation, originally in data download and more recently in distributed computing context, has been shown to be effective both in theory and practice. Analysis of systems with redundancy has drawn significant attention and numerous papers have studied pain and gain of redundancy under various service models and assumptions on the straggler characteristics. We here present a cost (pain) vs. latency (gain) analysis of using simple replication or erasure coding for straggler mitigation in executing jobs with many tasks. We quantify the effect of the tail of task execution times and discuss tail heaviness as a decisive parameter for the cost and latency of using redundancy. Specifically, we find that coded redundancy achieves better cost vs. latency tradeoff than simple replication and can yield reduction in both cost and latency under less heavy tailed execution times. We show that delaying redundancy is not effective in reducing cost and that delayed relaunch of stragglers can yield significant reduction in cost and latency. We validate these observations by comparing with the simulations that use empirical distributions extracted from Google cluster data.
IFIP WG 7.3 Performance 2017.Nov. 14-16, 2017, New York, NY USA
1 Introduction and Model
Motivation: Distributed (computing) systems aim to attain scalability through parallel execution of multiple tasks constituting a job. Each task is run on a separate node, and the job is completed only when the slowest task is finished. It has been observed that task execution times have significant variability, e.g., because of multiple job resource sharing, power management . The slowest tasks that determine the job execution time are known as ”stragglers”.
Two common performance metrics for distributed job execution are 1) Latency, measuring the execution time, and 2) Cost, measuring the resource usage. Job execution is desired to be fast and with low cost, but these are conflicting objectives. Replicating tasks and running the replicas over separate nodes has been shown to be effective in mitigating the effect of stragglers on latency , and is used in practice . Recent research proposes to delay replication in order to reduce the cost , and clone only the tasks that at some point appear to be straggling.
Erasure coding is a more general form of redundancy than simple replication, and it has been considered for straggler mitigation in both data download  and, more recently, in distributed computing context [8, 9].
We took this line of work further by analyzing the effect of coding and replication on the tradeoff between cost and latency as in . We examined whether the redundancy should be simple replication or coding, and when it should be introduced. We here extend the cost and latency analysis in  for systems that use redundancy together with task relaunch.
System Model: In our system, a job is split into tasks. Job execution starts with launching all its tasks, and the redundancy is introduced only if the job is not completed by some time . Note that we don’t consider queueing of jobs or tasks; all tasks start service together.
In replicated-redundancy -system, if the job still runs at time , then replicas for each remaining task are launched. In coded-redundancy -system, if the job still runs at time , redundant parity tasks are launched where completion of any of all launched tasks results in total job completion (see Fig. 1). Note that this assumption does not impose severe restrictions. Any linear computing algorithm can be implemented with this -out-of- structure simply by using linear erasure codes. Particular examples can be found in e.g., [8, 9] and references therein. If system implements task relaunch, then tasks that remain running at time are canceled, and fresh copies are launched in their place immediately together with the redundant tasks.
We assume that task execution times are iid and follow one of the two canonical distributions: 1) shifted exponential modeling tasks that take some positive minimum time and have an exponential tail with decaying rate modeling the randomness inherent in the system and 2) with positive minimum value and a power law tail index . Pareto is a canonical heavy tailed distribution that is observed to fit task execution times in real computing systems [6, 12].
Fig. 2 plots the empirical tail distribution of task completion times111Task lifetimes are calculated as the difference between timestamps for SCHEDULE and FINISH events in . for jobs with , , and tasks in the Google Trace data . Note that the and axes are in log scale, and thus an exponential tail would have appeared as a curve decaying exponentially while a true power law tail (e.g., Pareto) would have pronounced a linear decay at a constant rate. Empirical tail distributions in the figure exhibit exponential decay at small values and a trend similar to linear decay at larger values. Note that the steep decay at the far right edge is due to bounded support of the distributions. Even though we cannot conclude that these empirical distributions are distributed as Pareto, they clearly exhibit more variability and have heavier tail than an Exponential.
We define the cost of job execution as the sum of the lifetimes of all tasks (including redundant ones) involved in the job execution. This definition reflects “pay for resources” pricing, which is the most commonly used model in cloud services offered by Amazon, Google App Engine, Windows Azure . There are two main setups that define cost: 1) Cost with task cancellation ; remaining outstanding tasks are canceled upon the job completion, which is a viable option for distributed computing with redundancy, 2) Cost without task cancellation ; tasks remaining after job completion run until they complete, which for instance is the only option for data transmission over multi-path network with redundancy.
In this paper, we analyze the effect of replicated and coded redundancy on cost and latency tradeoff. Specifically, we present exact expressions for expected latency and cost for redundancy with and without task relaunch. Using these expressions, we show the correlation of pain and gain of redundancy with the tail heaviness of the task execution times.
Summary of Observations: Coding allows us to increase degree of redundancy with finer steps than replication, which translates into greater achievable cost vs. latency region. Delaying redundancy is not effective in trading off latency for cost. Therefore, primarily the degree of redundancy should be tuned for the desired cost and latency values. Coding is shown to outperform replication in terms of cost and latency together. When the task execution times are heavy tailed, redundancy can reduce cost and latency simultaneously, where the reduction depends on the tail heaviness. For heavy tailed tasks, we show that relaunching tasks, even without any redundancy, is sufficient to return significant reduction in cost and latency.
Notation: and denote latency and cost of job execution. is the th harmonic number defined as for or equivalently as for . denotes the generalized harmonic number of order of two defined as . Incomplete Beta function is defined for , as , and Beta function as . Gamma function is defined as for and as for .
2 Latency and Cost Analysis
In a previous work , we concluded that delaying replicated or coded redundancy is not effective to reduce cost. This conclusion is based on the observation that delaying redundancy can bring reduction in cost (gain) only after significant increase (pain) in latency, at which point one can achieve less latency for the same cost by simply reducing the level of redundancy. This section gives cost vs. latency analysis of zero-delay redundancy systems, where cost and latency are expressed in terms of level of redundancy or and parameters of task execution time distribution.
Thm. 1 gives exact expressions for the expected cost and latency under zero-delay redundancy, and Fig. 3 shows comparison between replication and coding for varying level of redundancy. Under both and task execution times, coding always achieves better expected cost and latency than replication.
Let expected latency and cost with task cancellation be , for zero-delay replicated redundancy, and , for zero-delay coded redundancy. Under task execution time , we have
Under task execution time , we have
[Sketch] For replicated redundancy, lifetime of each task is . Then the expected values of latency and cost follow from first principles of order statistics. Same calculations apply for expectation of latency and cost of execution with coded redundancy.
Under exponential tail, adding redundancy reduces latency but increases cost. In , replicated redundancy is demonstrated to reduce both cost and latency under heavy tailed task execution time. Fig. 3 plots latency and cost reduction for exponential and heavy tailed task execution times using the expressions given in Thm. 1. It illustrates the intuitive conclusion that redundancy can yield greater reduction in cost and latency under heavier tailed task execution times.
It is worth to discuss how and why cost reduction matters. Cost, as is defined here, reflects the amount of resource time used to execute a job. Reduction in cost then means running the same job by occupying less area in the space of overall system capacity, which then allows fitting more jobs per area, and hence higher system throughput. To illustrate with a simple example, consider a First-Come First-Served queue with single server under heavy traffic, that is, server never goes idle since there always exists a job to be served. Then, average throughput is the reciprocal of average job service time. In this case, cost of serving a job is simply its service time and system throughput increases with reduced average cost. In the case of systems with many servers that serve jobs with multiple tasks, same principles apply and cost reduction opens up space to execute more jobs per time unit.
Although exact expressions are formidable to derive, second moments of latency and cost can be computed as described in Thm. 2. Second moments enable us to compute the standard deviation of latency and cost. Fig. 4 plots expected cost and latency with error bars of width equal to the standard deviation in respective dimensions. Standard deviation, hence the variability of cost and latency decreases with increasing level of redundancy as expected. Reduction in variability for the same level of redundancy is greater with coding compared to using replication.
Then, under task execution time , second moments of latency and cost for zero-delay replicated and coded redundancy systems can be computed as
For , given and
Then, under task execution time , second moments of latency and cost for zero-delay replicated and coded redundancy systems can be computed as
Under heavy tail, it is possible to reduce latency by adding redundancy and still pay for the baseline cost of running with no redundancy. Corollary 1 gives expressions for the minimum achievable expected latency without exceeding the baseline cost of running with no redundancy.
Under task execution time in zero-delay replicated redundancy system, minimum latency that can be achieved without exceeding the baseline cost is,
where and any reduction in latency without exceeding the baseline cost is possible only if . For coded redundancy system,
or an upper bound on is
First consider replicated redundancy -system. Latency is a decreasing function of , while cost may decrease up to beyond which it increases with . We would like to find such that , then is simply . We next obtain the range of for which .
which holds only when , and since is a non-negative integer, , from which (1) follows.
Next consider a coded redundancy -system. Expressions for the expected cost in this case do not allow to find such that . Instead we can express as a function of as . We can then find an approximation for directly by relating to .
holds only if , from which (2) follows. Upper bound (4) follows from this by setting .
Fig. 5 illustrates that the maximum percentage reduction in latency (i.e., ; is latency without exceeding the baseline cost, is the latency with no redundancy) depends on the tail of task execution time. As stated in Corollary 1, this reduction is possible under replicated redundancy only when the tail index is less than , in other words when the tail is quite heavy, while coding relaxes this constraint. Moreover, the threshold on under replication is independent of the number of tasks , while threshold increases with under coding, meaning that jobs with larger number of tasks can get reduction in latency at no cost even for lighter tailed task execution times.
Demonstration using Google traces: We simulated expected cost and latency of redundancy systems by using the empirical task execution time distributions that we constructed using Google Trace data . Fig. 6 plots cost and latency curves using the three empirical distributions plotted in Fig. 2 for jobs with tasks.
For all three distributions, coding is doing better than replication in cost vs. latency tradeoff. Cost and latency can be reduced together with redundancy for all distributions because they pronounce a tail heavier than Exponential at large values. Note that although replication cannot reduce cost below the baseline cost of running with no redundancy, coding can achieve this for . Also for , although replication seems to achieve less cost and latency for low redundancy levels, coding outperforms replication beyond a certain level of redundancy.
3 Straggler Relaunch
There are two properties of heavy tailed distributions (e.g., Pareto) that greatly affect the latency of distributed job computation . First, if task execution times are heavy tailed, the longer a task has taken to execute, the longer its residual lifetime is expected to be. Second, majority of the mass in a set of sample observations drawn from a heavy tailed distribution is contributed by only a few samples. This suggests that among many heavy tailed tasks, very few of them are expected to be stragglers with very long completion time compared to the non-stragglers.
When task execution times are heavy tailed, after non-straggler tasks finish, we are expected to wait even longer for job completion since the remaining straggler tasks are expected to take at least as much more time as the non-stragglers have. This suggests that killing straggler tasks and launching fresh replacements can reduce the job completion time.
In this section, we study relaunching remaining tasks after some delay, and show that it can yield significant reduction in cost and latency when the tail of task execution time is heavy enough. We discuss about the level of tail heaviness required for relaunching to be useful. In the system we study, selection of tasks to relaunch is decided by the time relaunching is performed. Untimely relaunch may cause reduction in gain or even pain by either late relaunch and delayed cancellation of stragglers or early relaunch and killing non-stragglers. We present an approximate optimal relaunch time given the distribution of task execution times. Lastly, cost and latency of adding redundancy together with task relaunch is discussed.
No redundancy with relaunch: Thm. 3 gives exact expressions for expected latency and cost in no-redundancy systems, in which tasks that did not complete by time are relaunched without introducing any redundancy. Note that we assume that relaunching takes place instantly upon cancellation and thus, it does not incur additional delay or cost. Fig. 6(a) compares the latency with and without relaunch in a no-redundancy system. Relaunching tasks before the minimum task completion time causes work loss and increases latency, while relaunching at the right time gives significant reduction in job completion time. Since cost is a direct function of latency in the absence of redundancy, reduction in latency will certainly reduce cost, as plotted in Fig. 6(b). Notice that relaunching all tasks at the beginning () or not relaunching at all () implements the identical behavior and gives the identical cost and latency.
Under task execution time in a no-redundancy system (i.e., or ) with relaunch, the tail probability of a job completion time is
The expected job completion time is
The expected cost with () and without () task cancellation is
where and , which is the expected job completion time without relaunch.
[Sketch] Defining random variable as the number of tasks completed before , where . Derivation of the tail, and expected latency and cost follows from the law of total probability or expectation by conditioning on .
An approximation for the optimal relaunch delay that achieves the lowest latency and cost is given in Corollary 2. Convergence of this approximation to the true optimal is at a rate exponential with increasing . Approximate is very close to true optimal for as shown in Fig. 7. The optimal relaunch delay is an increasing function of minimum task execution time and number of tasks, which intuitively makes sense. Also, it is a decreasing function of , meaning that it is better to relaunch earlier when the tail of task execution times is lighter, while for heavy tail, delaying relaunch further helps to identify stragglers and performs better in terms of cost and latency. This is because relaunching is a choice of canceling work that is already completed to get possibly lucky and execute fresh copies much faster than the canceled stragglers. Expected gain from relaunching under light tail is less than that under heavier tail since heavier tailed stragglers are expected to take much longer. Therefore, when the tail is light, it is better to try our chance with relaunch earlier and decrease amount of work loss with task cancellation.
Under task execution time in a no-redundancy system, a sufficient condition on , which guarantees that expected cost and latency can be reduced by relaunching tasks at some time is
Optimal relaunch time to achieve minimum cost and latency is approximated as
which implies that optimal fraction of tasks to relaunch on average can be approximated as
Upper bound on , and approximations for and get tighter as the number of tasks increases.
Approximation for follows from approximation for large and then minimizing expected latency given in eq. (6) by taking derivative with respect to .
Approximation for follows by observing that number of tasks completed before is where . Then, average fraction of tasks that are relaunched is in which we can use approximation for large .
Secondly, we show the sufficient condition on to be able to reduce with relaunch. Let be job completion time in no-redundancy system with no relaunch, then where .
This difference is smallest when for which we will use the approximate discussed above. We are interested in the maximum value of that would allow maximum possible difference to be negative as
where is by using the approximation for large . For , which is what we assume since it is a requirement for finite expected latency, holds, and so sufficient condition given in eq. (8) follows.
Expression for optimal delay tells us that it is better to relaunch earlier when task execution times have lighter tail. However, relaunching earlier does not mean that more tasks will be relaunched. Fraction of tasks that are relaunched after optimal delay monotonically decreases with 222 is a monotonically decreasing function of . For very heavy tail i.e., , for very light tail i.e., ., i.e., as the tail gets lighter. Notice that decreases with , which means for jobs with larger number of tasks, optimal strategy dictates relaunching smaller fraction of the tasks. For instance, suppose and , then , in other words, only of the tasks would need to be relaunched on average for optimal latency and cost.
Note that we assume relaunching takes place instantly and does not introduce any cost. Adding relaunch cost into the analysis, which we leave as a future work, could make the analysis more realistic and also give more insight in searching for optimal relaunch strategy in practice.
Relaunching allows reducing the average number of stragglers by replacing tasks that appear to be straggling with fresh copies. For relaunching to be effective, loss incurred by starting fresh copies from scratch should be compensated by avoiding very long execution times of stragglers. In other words, for relaunching to be able to reduce latency and cost, task execution times must be heavy tailed beyond a threshold. If the tail is lighter than this threshold, relaunching actually hurts and increases cost and latency (e.g., relaunching always hurts when task execution times are light tailed). Corollary 2 gives a sufficient condition on tail index such that for any less than , relaunching helps to reduce cost and latency. Note that this upper bound does not depend on the minimum task completion time and is only proportional to the logarithm of the number of tasks , which we also validated by numerically computing the exact upper limit on . This upper bound given as the sufficient condition and the exact upper limit on get closer as increases, which is illustrated in Fig. 8.
Redundancy with relaunch: Here we study the effect of adding redundancy together with relaunching remaining tasks at time . We modify the previously studied redundancy systems as the follows. In replication -system, each remaining task at time is relaunched together with new replicas. In coding -system, each remaining task at time is relaunched and overall new parity tasks are added.
Thm. 4 state exact expressions for the expected cost and latency in respectively replicated and coded redundancy systems with relaunch. Fig. 9 plots cost vs. latency curves for varying level of redundancy. When level of coded redundancy is low (e.g., ), there is an optimum delay that gives the minimum cost and latency as observed previously for no-redundancy system with relaunch. As the level of coded redundancy increases (e.g., ), redundancy becomes a greater effect on cost and latency than relaunching stragglers, and delaying redundancy becomes not effective in reducing cost as observed previously for redundancy systems with no relaunch (see ). In replication system, delaying redundancy is ineffective to reduce cost at all times. This is because replicating each remaining task even by one is enough to dominate relaunching stragglers in terms of the effect on cost and latency.
Suppose task execution time is . In replication -system with relaunch, expected job execution time is
where is the expected job completion time for no-redundancy system with relaunch as given in eq. (6).
Expected cost with () and without () task cancellation is
where and . In coding -system with relaunch, expected job execution time is
Expected cost with () and without () task cancellation is
4 Open Problems
In the analysis given here, we assumed that execution time of each task is iid and does not depend on resources they run on. Main justification for this assumption is that computing systems usually have large number of nodes and each task can be placed on a separate node at random. In other words, we assumed that the task execution time is decoupled from resource scheduling. However, execution time of a task that exclusively runs on a node would have different distribution than a task that runs together with several others on the same node. There are two challenges in the analysis of latency and cost with resource scheduling in mind: 1) There is not enough experimental evidence to accurately model execution time with respect to resource load, 2) Order statistics of multiple distribution families are intractable.
Moreover, in reality, task execution times cannot be arbitrarily large, but modeling as SExp or Pareto does not implement this restriction. Another possible extension of the cost and latency analysis would be considering right truncated execution time models. Main challenge for this extension is that truncation renders the order statistics analysis tedious and often intractable.
-  Al-Roomi, M., Al-Ebrahim, S., Buqrais, S., and Ahmad, I. Cloud computing pricing models: a survey. International Journal of Grid and Distributed Computing 6, 5 (2013), 93–106.
-  Ananthanarayanan, G., Ghodsi, A., Shenker, S., and Stoica, I. Effective straggler mitigation: Attack of the clones. In NSDI (2013), vol. 13, pp. 185–198.
-  Arnold, B. C. Pareto distribution. Wiley Online Library, 2015.
-  Arnold, B. C., Balakrishnan, N., and Nagaraja, H. N. A first course in order statistics. SIAM, 2008.
-  Crovella, M. E. Performance evaluation with heavy tailed distributions. In Workshop on Job Scheduling Strategies for Parallel Processing (2001), Springer, pp. 1–10.
-  Dean, J., and Barroso, L. A. The tail at scale. Communications of the ACM 56, 2 (2013), 74–80.
-  Dean, J., and Ghemawat, S. Mapreduce: simplified data processing on large clusters. Communications of the ACM 51, 1 (2008), 107–113.
-  Dutta, S., Cadambe, V., and Grover, P. Short-dot: Computing large linear transforms distributedly using coded short dot products. In Advances In Neural Information Processing Systems (2016), pp. 2092–2100.
-  Halbawi, W., Azizan-Ruhi, N., Salehi, F., and Hassibi, B. Improving distributed gradient descent using reed-solomon codes. arXiv preprint arXiv:1706.05436 (2017).
-  Joshi, G., Soljanin, E., and Wornell, G. Queues with redundancy: Latency-cost analysis. ACM SIGMETRICS Performance Evaluation Review 43, 2 (2015), 54–56.
-  Aktaş, M., Peng, P., Soljanin, E. Effective straggler mitigation: Which clones should attack and when? In MAMA Workshop SIGMETRICS (2017).
-  Reiss, C., Tumanov, A., Ganger, G. R., Katz, R. H., and Kozuch, M. A. Towards understanding heterogeneous clouds at scale: Google trace analysis. Intel Science and Technology Center for Cloud Computing, Tech. Rep (2012), 84.
-  Reiss, C., Wilkes, J., and Hellerstein, J. L. Google cluster-usage traces: format+ schema. Google Inc., White Paper (2011), 1–14.
-  Wang, D., Joshi, G., and Wornell, G. Using straggler replication to reduce latency in large-scale parallel computing. ACM SIGMETRICS Performance Evaluation Review 43, 3 (2015), 7–11.