A new analysis of Work Stealing with latency
Abstract.
We study in this paper the impact of communication latency on the classical Work Stealing load balancing algorithm. Our paper extends the reference model in which we introduce a latency parameter. By using a theoretical analysis and simulation, we study the overall impact of this latency on the Makespan (maximum completion time). We derive a new expression of the expected running time of a bag of tasks scheduled by Work Stealing. This expression enables us to predict under which conditions a given run will yield acceptable performance. For instance, we can easily calibrate the maximal number of processors to use for a given work/platform combination. All our results are validated through simulation on a wide range of parameters.
1. Introduction
The motivation of this work is to study how to extend the analysis of the Work Stealing (WS) algorithm in a distributedmemory context, where communications matter. WS is a classical online scheduling algorithm proposed for sharedmemory multicores (Arora et al., 2001) whose principle is recalled in the next section. As it is common, we target the minimization of the Makespan, defined as the maximum completion time of the parallel application. We present a theoretical analysis for an upper bound of the expected makespan and we run a complementary series of simulations in order to assess how this new bound behaves in practice depending on the value of the latency.
1.1. Motivation for studying WS with latency
Distributedmemory clusters consist in independent processing elements with private local memories linked by an interconnection network. In such architectures, communication issues are crucial, they highly influence the performances of the applications (Hamid et al., 2015b). However, there are only few works dealing with optimized allocation strategies and the relationships with the allocation and scheduling process is most often ignored. In practice, the impact of scheduling may be huge since the whole execution can be highly affected by a large communication latency of interconnection networks (Hamid et al., 2015a). Scheduling is the process which aims at determining where and when to execute the tasks of a target parallel application. The applications are represented as directed acyclic graphs where the vertices are the basic operations and the arcs are the dependencies between the tasks (Cosnard and Trystram, 1995). Scheduling is a crucial problem which has been extensively studied under many variants for the successive generations of parallel and distributed systems. The most commonly studied objective is to minimize the makespan (denoted by ) and the underlying context is usually to consider centralized algorithms. This assumption is not always realistic, especially if we consider distributed memory allocations and an online setting.
WS is an efficient scheduling mechanism targeting medium range parallelism of multicores for finegrain tasks. Its principle is briefly recalled as follows: each processor manages its own (local) list of tasks. When a processor becomes idle it randomly chooses another processor and steals some work (if possible). Its analysis is probabilistic since the algorithm itself is randomized. Today, the research on WS is driven by the question on how to extend the analysis for the characteristics of new computing platforms (distributed memory, large scale, heterogeneity). Notice that beside its theoretical interest, WS has been implemented successfully in several languages and parallel libraries including Cilk (Frigo et al., 1998; Leiserson, 2009), TBB (Threading Building Blocks) (Robison et al., 2008), the PGAS language (Dinan et al., 2009; jai Min et al., 2011) and the KAAPI runtime system (Gautier et al., 2007).
1.2. Related works
We start by reviewing the most relevant theoreticallyoriented works. WS has been studied originally by Blumofe and Leiserson in (Blumofe and Leiserson, 1999). They showed that the expected Makespan of a seriesparallel precedence graph with unit tasks on processors is bounded by where is the length of the critical path of the graph (its depth). This analysis has been improved in Arora et al. (Arora et al., 2001) using potential functions. The case of varying processor speeds has been studied by Bender and Rabin in (Bender and Rabin, 2002) where the authors introduced a new policy called high utilization scheduler that extends the homogeneous case. The specific case of treeshaped computations with a more accurate model has been studied in (Sanders, 1999). However, in all these previous analyses, the precedence graph is constrained to have only one source and an outdegree of at most 2 which does not easily model the basic case of independent tasks. Simulating independent tasks with a binary precedences tree gives a bound of since a complete binary tree of vertices has a depth . However, with this approach, the structure of the binary tree dictates which tasks are stolen. In complement, (Gast and Bruno, 2010) provided a theoretical analysis based on a Markovian model using mean field theory. They targeted the expectation of the average response time and showed that the system converges to a deterministic Ordinary Differential Equation. Note that there exist other results that study the steady state performance of WS when the work generation is random including Berenbrink et al. (Berenbrink et al., 2003), Mitzenmacher (Mitzenmacher, 1998), Lueling and Monien (Lüling and Monien, 1993) and Rudolph et al. (Rudolph et al., 1991). More recently, in (Tchiboukdjian et al., 2013), Tchiboukjian et al. provided the best bound known at this time: where .
In all these previous theoretical results, communications are not directly addressed (or at least are taken implicitly into account by the underlying model). WS largely focused on shared memory systems and its performance on modern platforms (distributedmemory systems, hierarchical plateform, clusters with explicit communication cost) is not really well understood. The difficulty lies in the problems of communication which become more crucial in modern platforms (Arafat et al., 2016).
Dinan et al. (Dinan et al., 2009) implemented WS on largescale clusters, and proposed to split each tasks queues into a part accessed asynchronously by local processes and a shared portion synchronized by a lock which can be access by any remote processes in order to reduce contention. Multicore processor based on NUMA (nonuniform memory access) architecture is the mainstream today and such new platforms include accelerators as GPGPU (Yang and He, 2017). The authors propose an efficient task management mechanism which is to divide the application into a large number of fine grained tasks which generate large amount of small communications between the CPU and the GPGPUs. However, these data transmissions are very slow. They consider that the transmission time of small data and big data is the same.
Besides the large literature on theoretical works, there exist more practical studies implementing WS libraries where some attempts were provided for taking into account communications: SLAW is a taskbased library introduced in (Guo et al., 2010), combining workfirst and helpfirst scheduling policies focused on locality awareness in PGAS (Partitioned Global Address Space) languages like UPC (Unified Parallel C). It has been extended in HotSLAW, which provides a high level API that abstracts concurrent task management (jai Min et al., 2011). (Li et al., 2013) proposes an asynchronous WS (AsynchWS) strategy which exploits opportunities to overlap communication with local tasks allowing to hide high communication overheads in distributed memory systems. The principle is based on a hierarchical victim selection, also based on PGAS. Perarnau and Sato presented in (Perarnau and Sato, 2014) an experimental evaluation of WS on the scale of ten thousands compute nodes where the communication depends on the distance between the nodes. They investigated in detail the impact of the communication on the performance. In particular, the physical distance between remote nodes is taken into account. Mullet et al. studied in (Muller and Acar, 2016) LatencyHiding, a new WS algorithm that hides the overhead caused by some operations, such as waiting for a request from a client or waiting for a response from a remote machine. The authors refer to this delay as latency which is slightly different that the more general concept we consider in our paper. Agrawal et al. proposed an analysis (Agrawal et al., 2010) showing the optimality for task graphs with bounded degrees and developed a library in Cilk++ called Nabbit for executing tasks with arbitrary dependencies, with reasonable block sizes.
1.3. Contributions
In this work, we study how communication latency impacts work stealing. Our work has three main contributions. First, we create a new realistic scheduling model for distributedmemory clusters of identical processors including latency denoted by . Second, we provide an upper bound of the expected makespan. This bound is composed of the usual lower bound on the best possible loadbalancing plus an additional term proportional to where and are the total amount of work and the total number of processors respectively. Third, we provide simulation results to assess this bound. These experiments show that the theoretical bound is roughly 5 times greater than the one observed in the experiments but that the additional term has indeed the form . The theoretical analysis shows that while the simulation results suggest that .
The analysis is based on an adequate potential function. There are two reasons that distinguish this analysis in regard to the existing ones: finding the right function (the natural extension does not work since we now need to consider in transit work). Its property is that it should diminish after any steal related operation. We also consider large timesteps of duration equal to the communication latency.
2. WorkStealing
In this section, we introduce some formal notations and the WS algorithm. Finally we present the variant of the WS algorithm that we analyse.
2.1. Notation and definition
We consider a discrete time model. There are processors. We denote by the amount of work on processor at time (for ). At unit of work corresponds to one unit of execution time. We denote the total amount of work on all processors by . At all work is on . The total amount of work at time is denoted by .
2.2. WS algorithm
Work Stealing is a decentralized list scheduling algorithm where each processor maintains its own local queue of tasks to execute. uses to get and execute tasks while is not empty. When becomes empty chooses another processor randomly and sends a steal request to it. When receives this request, he answers by either sending some of its work or by a fail response. We will see bellow the case of the fail response.
The answer of can be to transfer some of its work or a negative response.
We analyse one of the variants of the WS algorithm that has the following features:

Latency: All communication takes a time that we call the latency. A work request that is sent at time by a thief will be received at time by the victim. The thief will then receive an answer at time . As we consider a discretetime model, we say that a work request arrives at time if it arrives between (notincluded) and . This means that at time , this work request is treated. The number of incoming work requests at time is denoted by . Note that . It is equal to the number of processors sending a work request at time .
When a processor receives a work request from a thief , it sends a part of its work to . This communication takes again units of time. receives the work at time . We denote by the amount of work in transit from at time . At end of the communication becomes 0 until a new work request arrives.

Steal Threshold: The main goal of WS is to share work between processors in order to balance load and speedup execution. In some cases however it might be beneficial to keep work local and answer negatively to some steal requests. We assume that if the victim has less than unit of work to execute, the steal request fails.

Single work transfer: We assume that a processor can send some work to at most one processor at a time. While the processor sends work to a thief it replies by a fail response to any other steal request. Using this variant, the steal request may fail in the following cases: when the victim does not have enough work or when it is already sending some work to another thief. Another case might happen when the victim receives more than one steal request at the same time. He deals a random thief and send a negative response to the remaining thieves.

Work division: in order to keep a balanced WS, we consider that the victim sends to the thief a part of work such that both loads will be balanced at the end of the communication. More precisely, if receives a work request from at time then:
(1) After a time :
(2)
3. Theoretical Analysis
3.1. Principle
Before presenting the detailed analysis, we first describe its main steps.
We denote by the makespan (i.e., total execution time). In a WS algorithm each processor either executes work or tries to steal work. As the roundtriptime of a communication is and the total amount of work is equal to and the number of processors is , we have where is the number of processors. We therefore have a straightforward bound of the Makespan:
(3) 
Note that the above inequality is not an equality because the executing might end while some processors are still waiting for work.
Our analysis makes use of a pontential function that represents how well the jobs are balanced in the system. We bound the number of steal requests by showing that each event involving a steal operation contributes to the decrease of this potential function. This analysis shares some similarities with the one of (Tchiboukdjian et al., 2013) as we make use of a potential function that decreases with steal requests. The key difficulty to apply these ideas in our case is that communications take time units. At first, it seems that longer durations should translate linearly into the time taken by steal requests but this would neglect the fact that longer steal durations reduce the number of steal requests.
In order to analyse the impact of , we reconsider the time division as periods of duration . We analyse the system at each time step for . To simplify the notations between the time and the interval division , we denote respectively by and the quantities and . We denote by the part of the potential linked to processor and by the potential. We also define the total number of incoming steal work requests in the interval by and we denote by the probability that a processor receives one or more requests in the interval (this function will be computed in the next section).
In the next section, we analyze the decrease of as a function of the number of steal requests. We show that there exists a function depending on the number of incoming steal requests in the time interval , such that in average, is less than . Finally, we use this to derive a bound on the total number of steal requests. By using equation (3), we obtain a bound on the Makespan.
4. Detailed Analysis
In this section, we prove the main result of the paper, which is a bound on the total completion time and is summarized by the following theorem :
Theorem 4.1 ().
Let be the Makespan of unit independent tasks scheduled by WS with latency algorithm. Then,
In particular:
The proof of this result is based on the analysis of the decrease of a function – that we call the potential. The potential at timestep is denoted and is defined as
This potential function always decreases. It is maximal when all work is contained in one processor which is the potential function at time and is equal to . The schedule completes when the potential becomes .
We divide our proof of Theorem 4.1 in two lemmas. First, in Lemma 4.2 we show that the expected decrease of the potential can be bounded as a function of the number of work requests. Second, in Lemma 4.3, we show how such a bound leads to a bound on the expected number of work requests.
We denote by all events up to the interval .
Lemma 4.2 ().
For all steps , the expected ratio between and knowing is bounded by:
(4) 
Where is the probability for a processor to receive one or more requests in the interval knowing that there are incoming steal requests.
Proof.
To analyze the decrease of the potential function, we distinguish different cases depending on whether the processor is executing work, sending or answering steal requests. We show that each case contributes to a variation of potential.
Between time and , a processor does one the following things:

The processor is executing and sending work.

The processor is executing work and available to respond to the steal requests sent by idle processors.

The processor is executing work and will be idle soon.

The processor is idle.
We analyse below how much each case contributes to the decrease of the potential.

: is executing and started to send work to another idle processor before time . This means that will receive work before time . As we do not know when the communication finishes, we study the worst case in this scenario. In particular, we make as if (a) and respond negatively to any steal request before the end of communication; and (b) they do not execute work. Such events would decrease the potential function.
By the ”workdivision” principle (Equation (1)), once the steal request is completed we have . This shows that the quantities of work and of work in transit at time satisfies
Thus,
Moreover by Equation (2) we have . This shows that . Hence, the ratio of potential each couple ( as victim, as thief) is less than :

: is executing work and available to respond to a steal request. We distinguish two cases: (case 2a) if receives a requests or (case 2b) if it does not receive a request.
Case 2a – We compute the ratio of potential between between and when receives one or more steal requests. will respond to the first steal request. All other steal requests will fail. The worst case is to receive the steal request in the end of the interval. This shows that
which implies that
This generates a ratio of potential between and for each couple ( as victim, as thief) smaller than :
Case 2b – If does not receive any work requests, the work decreases, in which case .
Let be the probability that the processor receives a work request between and . To compute , we observe that receives zero work requests if the thieves choose another processor. Each of these events is independent and happens with probability . Hence, the probability that receives one or more work requests is :
(5) This shows that the ratio of potential in this scenario is:

: with little amount of work and , in this case will respond negatively to any work requests and the potential function goes to and generates a ratio equal to

is idle, so it can be thief in which waits for work or thief in which sends a steal request. In both cases we already have taken its contribution to the potential into account.
Using the variation of each these scenarios we find that the expected potential time is bounded by:
Thus,
where the last inequality holds because and therefore . ∎
Lemma 4.3 ().
Assume that there exists a function : such that the expected potential at time given satisfies:
(6) 
Let denote the potential at time and let be defined as:
Let be the first time step at which is less than for all processors:
The number of incoming steal requests until is, , satisfies:
Proof.
By definition of , for a quantity , we have and therefore . By using Equation (6), this shows that
Let . By using the equation above, we have:
This shows that is a submartingale for the filtration . As is a stopping time for the filtration , Doob’s optional stopping theorem (see e.g., (Durrett, 1996, Theorem 4.1)) implies that
(7) 
By definition of , we have and . Hence, this implies that
(8) 
Recall that is the first interval in which each processor has an amount of work less than , {: : }. This means that at , there exists at least one processor with . If this processor received a steal request between and , we have . If this processor did not receive a steal request between and , we have . This implies that
Plugging this into Equation (8) shows that
(9) 
By Jensen’s inequality (see e.g., (Durrett, 1996, Equation (3.2))), we have . This shows that
As , we have . Hence:
(10) 
By Markov’s inequality, Equation (9) implies that for all :
By using , this implies that
∎
We are now ready to conclude the proof of Theorem 4.1, by applying the previous lemma with .
Proof of Theorem 4.1.
The number of incoming steal requests until is bounded by Lemma 4.3 . Using the definition of , there remain at most 3 steps of to finish the execution. By using Equation (3) we have,
We now show that the above constant satisfies . For that, we first show that the quantity is increasing in . Then, we use the maximum of and we bound the value by .
Let and . By definition of , can be written as:
Denoting and , the derivative of with respect to is:
As , the derivative of with regard to is . This shows that , thus is decreasing. Using the fact that for all :
Then,
∎
5. Experiments Analysis
In the previous section, we proved a new upper bound of the Makespan of WS with an explicit latency. The objective of this section is to study WS experimentally in order to confirm the theoretical results and to refine the constant . We developed to this end an adhoc simulator that follows strictly our model.
We start by describing our WS simulator and the considered test configurations. Using the experimental results, we show that our theoretical bound is close to the experimental results. We conclude with a discussion on these observations.
5.1. Simulator and configurations
We have developed a python discrete event simulator for running adequate experiments. This simulator follows the model described in Section 2 to schedule an amount of work on a distributed platform composed of identical processors. Between each two processors, the communication cost is equal to the latency . These three parameters are configurable and the simulator is generic enough to be used in different contexts of online scheduling and interfaces with standard trace analysis tools. To ensure reproducibility, the code is available on github^{1}^{1}1https://github.com/wagnerf42/wssimulator.
Let us describe our experimental parameters. We consider constant speed processors, which means that the work can be described in a time unit basis, and the same holds for the latency. Ultimately, only the ratio between and matters. Similar results would be observed by multiplying and by the same constant. For our tests we take different parameters with between and , between 32 and 256 and between 2 and 500. Each experiment has been reproduced 1000 times.
5.2. Validation of the bound
As seen before, the bound of the expected Makespan consists of two terms: the first term is the ratio which does not depend of the configuration and the algorithm, and the second term which represents the overhead related to steal requests. Our analysis bounds the second term to derive our bound on the Makespan.
Therefore, to analyze its validity, we define the overhead ratio as the ratio between the second term of our theoretical bound () and the execution time simulated minus the ratio . We study this overhead ratio under different parameters , and .
Figure 1 plots the overhead ratio according to each couple , for a latency of (1000 runs). Similar observations have been observed with all values of latency used. The xaxis is for all values of and intervals and the yaxis shows the overhead ratio. We use here a BoxPlot graphical method to present the results. BoxPlots give a good overview and a numerical summary of a data set. The âinterquartile rangeâ in the middle part of the plot represents the middle quartiles where 50% of the results are presented. The line inside the box presents the median. The whiskers on either side of the IQR represent the lowest and highest quartiles of the data. The ends of the whiskers represent the maximum and minimum of the data, and the individual points beyond the whiskers represent outliers.
We observe that our bound is systematically about 4 to 5.5 times greater to the one computed by simulation (depending on the range of parameters). The ratio between the two bounds decreases with the number of processors but seems fairly independent to .
5.3. Discussion
The challenge of this paper is to analyze WS algorithm with an explicit latency. We presented a new analysis which derives a bound on the expected Makespan for a given , and . It shows that the expected Makespan is bounded by plus an additional term bounded by times . As observed in Figure 1, the constant is about four to five times larger than the one observed by simulation. A more precise fitting based on simulation results leads to the expression . We briefly review here the main steps of the proof involving approximations and conclude with a simple fit of results on the best constant matching the expected Makespan.
First, recall that our analysis relies on , the number of steal requests arriving at each time interval. The exact values of all are unknown and we rely on worst case majorations. In the proof of Theorem 4.1 we define as the maximal value of . The function is about for small values of and tends to when the number of work requests is maximal. In the simulation, we observe that the number of work requests is often low. Therefore, our definition of probably contributes a factor to the overhead ratio.
The second approximation done is while we computed the diminution of the potential using the maximum diminution in all cases described in the proof of Lemma 4.2. This analysis could be improved by taking a more complex potential function but this will lead to much harder computations for marginal improvements.
The third approximation is that we assumed that we do not know when exactly steal requests arrive in the interval. We therefore always took the worst case (arrivals at the end of the intervals). We believe that this approximation has only a minor effect on the overhead ratio since it mainly impacts the potential diminution obtained from computations (as opposed to stealing).
Finally, the value of depends on . To achieve a constant bound, we consider again the worst case obtained as tends to infinity. This explains the fact that the overhead ratio increases slightly when the number of processors decreases ( for processors and for processors).
6. Conclusion
We presented in this paper a new analysis of Work Stealing algorithm where each communication has a communication latency of . Our main result was to show that the expected Makespan of a load of on a cluster of processors is bounded by .
We based our analysis on potential functions to bound the expected number of steal requests. We therefore derived a theoretical bound on the expected Makespan. We also extend this analysis one step further, by providing a bound on the probability to exceed the bound of the Makespan.
To assess the tightness of this analysis we developed an adhoc simulator. We showed by comparing the theoretical bound and the experimental results that our bound is realistic. We observed moreover that our bound (established on worst case analysis) is 5 times greater than the experimental results and it is stable for all the tested values.
This work will certainly be the basis of incoming studies on more complex hierarchical topologies where communications matter. As such, it is important as it allows a full understanding of the behavior of various Work Stealing implementations in a base setting.
References
 (1)
 Agrawal et al. (2010) K. Agrawal, C. E. Leiserson, and J. Sukha. 2010. Executing task graphs using workstealing. In 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS). 1–12. https://doi.org/10.1109/IPDPS.2010.5470403
 Arafat et al. (2016) Humayun Arafat, James Dinan, Sriram Krishnamoorthy, Pavan Balaji, and P. Sadayappan. 2016. Work Stealing for GPUaccelerated Parallel Programs in a Global Address Space Framework. Concurr. Comput. : Pract. Exper. 28, 13 (Sept. 2016), 3637–3654. https://doi.org/10.1002/cpe.3747
 Arora et al. (2001) Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 2001. Thread Scheduling for Multiprogrammed Multiprocessors. In In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Puerto Vallarta. 119–129.
 Bender and Rabin (2002) Michael A. Bender and Michael O. Rabin. 2002. Online Scheduling of Parallel Programs on Heterogeneous Systems with Applications to Cilk. Theory of Computing Systems 35, 3 (2002), 289–304. https://doi.org/10.1007/s0022400210555
 Berenbrink et al. (2003) Petra Berenbrink, Tom Friedetzky, and Leslie Ann Goldberg. 2003. The Natural WorkStealing Algorithm is Stable. SIAM J. Comput. 32, 5 (2003), 1260–1279. https://doi.org/10.1137/S0097539701399551 arXiv:http://dx.doi.org/10.1137/S0097539701399551
 Blumofe and Leiserson (1999) Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. J. ACM 46, 5 (Sept. 1999), 720–748. https://doi.org/10.1145/324133.324234
 Cosnard and Trystram (1995) Michel Cosnard and Denis Trystram. 1995. Parallel algorithms and architectures. International Thomson.
 Dinan et al. (2009) James Dinan, D. Brian Larkins, P. Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. 2009. Scalable Work Stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC ’09). ACM, New York, NY, USA, Article 53, 11 pages. https://doi.org/10.1145/1654059.1654113
 Durrett (1996) Richard Durrett. 1996. Probability: theory and examples. (1996).
 Frigo et al. (1998) Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementation of the Cilk5 Multithreaded Language. SIGPLAN Not. 33, 5 (May 1998), 212–223. https://doi.org/10.1145/277652.277725
 Gast and Bruno (2010) Nicolas Gast and Gaujal Bruno. 2010. A Mean Field Model of Work Stealing in Largescale Systems. SIGMETRICS Perform. Eval. Rev. 38, 1 (June 2010), 13–24. https://doi.org/10.1145/1811099.1811042
 Gautier et al. (2007) Thierry Gautier, Xavier Besseron, and Laurent Pigeon. 2007. KAAPI: A Thread Scheduling Runtime System for Data Flow Computations on Cluster of Multiprocessors. In Proceedings of the 2007 International Workshop on Parallel Symbolic Computation (PASCO ’07). ACM, New York, NY, USA, 15–23. https://doi.org/10.1145/1278177.1278182
 Guo et al. (2010) Y. Guo, J. Zhao, V. Cave, and V. Sarkar. 2010. SLAW: A scalable localityaware adaptive workstealing scheduler. In 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS). 1–12. https://doi.org/10.1109/IPDPS.2010.5470425
 Hamid et al. (2015a) N. Hamid, R. Walters, and G. Wills. 2015a. Simulation and Mathematical Analysis of Multicore Cluster Architecture. In 2015 17th UKSimAMSS International Conference on Modelling and Simulation (UKSim). 476–481. https://doi.org/10.1109/UKSim.2015.54
 Hamid et al. (2015b) Norhazlina Hamid, Robert John Walters, and Gary Brian Wills. 2015b. An analytical model of multicore multicluster architecture (MCMCA). (2015), 12 pages. http://eprints.soton.ac.uk/374744/
 jai Min et al. (2011) Seung jai Min, Costin Iancu, and Katherine Yelick. 2011. Hierarchical Work Stealing on Manycore Clusters. In In: Fifth Conference on Partitioned Global Address Space Programming Models. Galveston Island.
 Leiserson (2009) Charles E. Leiserson. 2009. The Cilk++ Concurrency Platform. In Proceedings of the 46th Annual Design Automation Conference (DAC ’09). ACM, New York, NY, USA, 522–527. https://doi.org/10.1145/1629911.1630048
 Li et al. (2013) S. Li, J. Hu, X. Cheng, and C. Zhao. 2013. Asynchronous Work Stealing on Distributed Memory Systems. In 2013 21st Euromicro International Conference on Parallel, Distributed, and NetworkBased Processing. 198–202. https://doi.org/10.1109/PDP.2013.35
 Lüling and Monien (1993) Reinhard Lüling and Burkhard Monien. 1993. A Dynamic Distributed Load Balancing Algorithm with Provable Good Performance. In Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA ’93). ACM, New York, NY, USA, 164–172. https://doi.org/10.1145/165231.165252
 Mitzenmacher (1998) Michael Mitzenmacher. 1998. Analyses of Load Stealing Models Based on Differential Equations. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA ’98). ACM, New York, NY, USA, 212–221. https://doi.org/10.1145/277651.277687
 Muller and Acar (2016) Stefan K. Muller and Umut A. Acar. 2016. LatencyHiding Work Stealing: Scheduling Interacting Parallel Computations with Work Stealing. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA ’16). ACM, New York, NY, USA, 71–82. https://doi.org/10.1145/2935764.2935793
 Perarnau and Sato (2014) S. Perarnau and M. Sato. 2014. Victim Selection and Distributed Work Stealing Performance: A Case Study. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 659–668. https://doi.org/10.1109/IPDPS.2014.74
 Robison et al. (2008) A. Robison, M. Voss, and A. Kukanov. 2008. Optimization via Reflection on Work Stealing in TBB. In 2008 IEEE International Symposium on Parallel and Distributed Processing. 1–8. https://doi.org/10.1109/IPDPS.2008.4536188
 Rudolph et al. (1991) Larry Rudolph, Miriam SlivkinAllalouf, and Eli Upfal. 1991. A Simple Load Balancing Scheme for Task Allocation in Parallel Machines. In Proceedings of the Third Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA ’91). ACM, New York, NY, USA, 237–245. https://doi.org/10.1145/113379.113401
 Sanders (1999) Peter Sanders. 1999. Asynchronous Random Polling Dynamic Load Balancing. Springer Berlin Heidelberg, Berlin, Heidelberg, 37–48. https://doi.org/10.1007/3540466320_5
 Tchiboukdjian et al. (2013) Marc Tchiboukdjian, Nicolas Gast, and Denis Trystram. 2013. Decentralized list scheduling. Annals of Operations Research 207, 1 (2013), 237–259. https://doi.org/10.1007/s1047901211497
 Yang and He (2017) Jixiang Yang and Qingbi He. 2017. Scheduling Parallel Computations by Work Stealing: A Survey. International Journal of Parallel Programming (2017), 1–25. https://doi.org/10.1007/s1076601604848