CoScheduling Algorithms for HighThroughput Workload Execution
Abstract
This paper investigates coscheduling algorithms for processing a set of parallel applications. Instead of executing each application one by one, using a maximum degree of parallelism for each of them, we aim at scheduling several applications concurrently. We partition the original application set into a series of packs, which are executed one by one. A pack comprises several applications, each of them with an assigned number of processors, with the constraint that the total number of processors assigned within a pack does not exceed the maximum number of available processors. The objective is to determine a partition into packs, and an assignment of processors to applications, that minimize the sum of the execution times of the packs. We thoroughly study the complexity of this optimization problem, and propose several heuristics that exhibit very good performance on a variety of workloads, whose application execution times model profiles of parallel scientific codes. We show that coscheduling leads to to faster workload completion time and to faster response times on average (hence increasing system throughput and saving energy), for significant benefits over traditional scheduling from both the user and system perspectives.
1 Introduction
The execution time of many highperformance computing applications can be significantly reduced when using a large number of processors. Indeed, parallel multicore platforms enable the fast processing of very large size jobs, thereby rendering the solution of challenging scientific problems more tractable. However, monopolizing all computing resources to accelerate the processing of a single application is very likely to lead to inefficient resource usage. This is because the typical speedup profile of most applications is sublinear and even reaches a threshold: when the number of processors increases, the execution time first decreases, but not linearly, because it suffers from the overhead due to communications and load imbalance; at some point, adding more resources does not lead to any significant benefit.
In this paper, we consider a pool of several applications that have been submitted for execution. Rather than executing each of them in sequence, with the maximum number of available resources, we introduce coscheduling algorithms that execute several applications concurrently. We do increase the individual execution time of each application, but (i) we improve the efficiency of the parallelization, because each application is scheduled on fewer resources; (ii) the total execution time will be much shorter; and (iii) the average response time will also be shorter. In other words, coscheduling increases platform yield (thereby saving energy) without sacrificing response time.
In operating high performance computing systems, the costs of energy consumption can greatly impact the total costs of ownership. Consequently, there is a move away from a focus on peak performance (or speed) and towards improving energy efficiency [12, 20]. Recent results on improving the energy efficiency of workloads can be broadly classified into approaches that focus on dynamic voltage and frequency scaling, or alternatively, task aggregation or coscheduling. In both types of approaches, the individual execution time of an application may increase but there can be considerable energy savings in processing a workload.
More formally, we deal with the following problem: given (i) a distributedmemory platform with processors, and (ii) applications, or tasks, , with their execution profiles ( is the execution time of with processors), what is the best way to coschedule them, i.e., to partition them into packs, so as to minimize the sum of the execution times over all packs. Here a pack is a subset of tasks, together with a processor assignment for each task. The constraint is that the total number of resources assigned to the pack does not exceed , and the execution time of the pack is the longest execution time of a task within that pack. The objective of this paper is to study this coscheduling problem, both theoretically and experimentally, We aim at demonstrating the gain that can be achieved through coscheduling, both on platform yield and response time, using a set of reallife application profiles.
On the theoretical side, to the best of our knowledge, the complexity of the coscheduling problem has never been investigated, except for the simple case when one enforces that each pack comprises at most tasks [21]. While the problem has polynomial complexity for the latter restriction (with at most tasks per pack), we show that it is NPcomplete when assuming at most tasks per pack. Note that the instance with is the general, unconstrained, instance of the coscheduling problem. We also propose an approximation algorithm for the general instance. In addition, we propose an optimal processor assignment procedure when the tasks that form a pack are given. We use these two results to derive efficient heuristics. Finally, we discuss how to optimally solve smallsize instances, either through enumerating partitions, or through an integer linear program: this has a potentially exponential cost, but allows us to assess the absolute quality of the heuristics that we have designed. Altogether, all these results lay solid theoretical foundations for the problem.
On the experimental side, we study the performance of the heuristics on a variety of workloads, whose application execution times model profiles of parallel scientific codes. We focus on three criteria: (i) cost of the coschedule, i.e., total execution time; (ii) packing ratio, which evaluates the idle time of processors during execution; and (iii) response time compared to a fully parallel execution of each task starting from shortest task. The proposed heuristics show very good performance within a short running time, hence validating the approach.
The paper is organized as follows. We discuss related work in Section 2. The problem is then formally defined in Section 3. Theoretical results are presented in Section 4, exhibiting the problem complexity, discussing subproblems and optimal solutions, and providing an approximation algorithm. Building upon these results, several polynomialtime heuristics are described in Section 5, and they are thoroughly evaluated in Section 6. Finally we conclude and discuss future work in Section 7.
2 Related work
In this paper, we deal with pack scheduling for parallel tasks, aiming at makespan minimization (recall that the makespan is the total execution time). The corresponding problem with sequential tasks (tasks that execute on a single processor) is easy to solve for the makespan minimization objective: simply make a pack out of the largest tasks, and proceed likewise while there remain tasks. Note that the pack scheduling problem with sequential tasks has been widely studied for other objective functions, see Brucker et al. [4] for various job cost functions, and Potts and Kovalyov [18] for a survey. Back to the problem with sequential tasks and the makespan objective, Koole and Righter in [13] deal with the case where the execution time of each task is unknown but defined by a probabilistic distribution. They showed counterintuitive properties, that enabled them to derive an algorithm that computes the optimal policy when there are two processors, improving the result of Deb and Serfozo [7], who considered the stochastic problem with identical jobs.
To the best of our knowledge, the problem with parallel tasks has not been studied as such. However, it was introduced by Dutot et al. in [8] as a moldablebyphase model to approximate the moldable problem. The moldable task model is similar to the packscheduling model, but one does not have the additional constraint (pack constraint) that the execution of new tasks cannot start before all tasks in the current pack are completed. Dutot et al. in [8] provide an optimal polynomialtime solution for the problem of pack scheduling identical independent tasks, using a dynamic programming algorithm. This is the only instance of packscheduling with parallel tasks that we found in the literature.
A closely related problem is the rectangle packing problem, or 2DStrippacking. Given a set of rectangles of different sizes, the problem consists in packing these rectangles into another rectangle of size . If one sees one dimension () as the number of processors, and the other dimension () as the maximum makespan allowed, this problem is identical to the variant of our problem where the number of processors is preassigned to each task: each rectangle of size that has to be packed can be seen as the task to be computed on processors, with . In [22], Turek et al. approximated the rectangle packing problem using shelfbased solutions: the rectangles are assigned to shelves, whose placements correspond to constant time values. All rectangles assigned to a shelf have equal starting times, and the next shelf is placed on top of the previous shelf. This is exactly what we ask in our packscheduling model. This problem is also called level packing in some papers, and we refer the reader to a recent survey on 2Dpacking algorithms by Lodi et al. [16]. In particular, Coffman et al. in [6] show that level packing algorithm can reach a approximation for the 2DStrippacking problem ( when the length of each rectangle is bounded by 1). Unfortunately, all these algorithms consider the number of processors (or width of the rectangles) to be already fixed for each task, hence they cannot be used directly in our problem for which a key decision is to decide the number of processors assigned to each task.
In practice, pack scheduling is really useful as shown by recent results. Li et al. [15] propose a framework to predict the energy and performance impacts of poweraware MPI task aggregation. Frachtenberg et al. [9] show that system utilization can be improved through their schemes to coschedule jobs based on their loadbalancing requirements and interprocessor communication patterns. In our earlier work [21], we had shown that even when the packsize is limited to , coscheduling based on speedup profiles can lead to faster workload completion and corresponding savings in system energy.
Several recent publications [2, 5, 11] consider coscheduling at a single multicore node, when contention for resources by coscheduled tasks leads to complex tradeoffs between energy and performance measures. Chandra et al. [5] predict and utilize interthread cache contention at a multicore in order to improve performance. Hankendi and Coskun [11] show that there can be measurable gains in energy per unit of work through the application of their multilevel coscheduling technique at runtime which is based on classifying tasks according to specific performance measures. Bhaduria and McKee [2] consider local search heuristics to coschedule tasks in a resourceaware manner at a multicore node to achieve significant gains in thread throughput per watt.
These publications demonstrate that complex tradeoffs cannot be captured through the use of the speedup measure alone, without significant additional measurements to capture performance variations from crossapplication interference at a multicore node. Additionally, as shown in our earlier work [21], we expect significant benefits even when we aggregate only across multicore nodes because speedups suffer due to of the longer latencies of data transfer across nodes. We can therefore project savings in energy as being commensurate with the savings in the time to complete a workload through coscheduling. Hence, we only test configurations where no more than a single application can be scheduled on a multicore node.
3 Problem definition
The application consists of independent tasks . The target execution platform consists of identical processors, and each task can be assigned an arbitrary number of processors, where . The objective is to minimize the total execution time by coscheduling several tasks onto the resources. Note that the approach is agnostic of the granularity of each processor, which can be either a single CPU or a multicore node.
Speedup profiles – Let be the execution time of task with processors, and be the corresponding work. We assume the following for and :
(1) 
(2) 
Equation (1) implies that execution time is a nonincreasing function of the number of processors. Equation (2) states that efficiency decreases with the number of enrolled processors: in other words, parallelization has a cost! As a side note, we observe that these requirements make good sense in practice: many scientific tasks are such that first decreases (due to loadbalancing) and then increases (due to communication overhead), reaching a minimum for ; we can always let for by never actually using more than processors for .
Coschedules – A coschedule partitions the tasks into groups (called packs), so that (i) all tasks from a given pack start their execution at the same time; and (ii) two tasks from different packs have disjoint execution intervals. See Figure 1 for an example. The execution time, or cost, of a pack is the maximal execution time of a task in that pack, and the cost of a coschedule is the sum of the costs of each pack.
inCoSchedule optimization problem – Given a fixed constant , find a coschedule with at most tasks per pack that minimizes the execution time. The most general problem is when , but in some frameworks we may have an upper bound on the maximum number of tasks within each pack.
4 Theoretical results
First we discuss the complexity of the problem in Section 4.1, by exhibiting polynomial and NPcomplete instances. Next we discuss how to optimally schedule a set of tasks in a single pack (Section 4.2). Then we explain how to compute the optimal solution (in expected exponential cost) in Section 4.3. Finally, we provide an approximation algorithm in Section 4.4.
4.1 Complexity
Theorem 1.
The inCoSchedule and inCoSchedule problems can both be solved in polynomial time.
Proof.
This result is obvious for inCoSchedule: each task is assigned exactly processors (see Equation (1)) and the minimum execution time is .
This proof is more involved for inCoSchedule, and we start with the inCoSchedule problem to get an intuition. Consider the weighted undirected graph , where , each vertex corresponding to a task . The edge set is the following: (i) for all , there is a loop on of weight ; (ii) for all , there is an edge between and of weight . Finding a perfect matching of minimal weight in leads to the optimal solution to inCoSchedule, which can thus be solved in polynomial time.
For the inCoSchedule problem, the proof is similar, the only difference lies in the construction of the edge set : (i) for all , there is a loop on of weight ; (ii) for all , there is an edge between and of weight . Again, a perfect matching of minimal weight in gives the optimal solution to inCoSchedule. We conclude that the inCoSchedule problem can be solved in polynomial time. ∎
Theorem 2.
When , the inCoSchedule problem is strongly NPcomplete.
Proof.
We prove the NPcompleteness of the decision problem associated to inCoSchedule: given independent tasks, processors, a set of execution times for and satisfying Equations (1) and (2), a fixed constant and a deadline , can we find a coschedule with at most tasks per pack, and whose execution time does not exceed ? The problem is obviously in NP: if we have the composition of every pack, and for each task in a pack, the number of processors onto which it is assigned, we can verify in polynomial time: (i) that it is indeed a pack schedule; (ii) that the execution time is smaller than a given deadline.
We first prove the strong completeness of inCoSchedule. We use a reduction from 3Partition. Consider an arbitrary instance of 3Partition: given an integer and integers , can we partition the integers into triplets, each of sum ? We can assume that , otherwise has no solution. The 3Partition problem is NPhard in the strong sense [10], which implies that we can encode all integers (, …, , ) in unary. We build the following instance of inCoSchedule: the number of processors is , the deadline is , there are tasks , with the following execution times: for all , if then , otherwise . It is easy to check that Equations (1) and (2) are both satisfied. For the latter, since there are only two possible execution times for each task, we only need to check Equation (2) for , and we do obtain that . Finally, has a size polynomial in the size of , even if we write all instance parameters in unary: the execution time is , and the have the same size as the .
We now prove that has a solution if and only if does. Assume first that has a solution. For each triplet of , we create a pack with the three tasks where is scheduled on processors, on processors, and on processors. By definition, we have , and the execution time of this pack is . We do this for the triplets, which gives a valid coschedule whose total execution time . Hence the solution to .
Assume now that has a solution. The minimum execution time for any pack is (since it is the minimum execution time of any task and a pack cannot be empty). Hence the solution cannot have more than packs. Because there are tasks and the number of elements in a pack is limited to three, there are exactly packs, each of exactly elements, and furthermore all these packs have an execution time of (otherwise the deadline is not matched). If there were a pack such that , then one of the three tasks, say , would have to use fewer than processors, hence would have an execution time greater than . Therefore, for each pack , we have . The fact that this inequality is an equality for all packs follows from the fact that . Finally, we conclude by saying that the set of triplets for every pack is a solution to .
The final step is to prove the completeness of inCoSchedule for a given . We perform a similar reduction from the same instance of 3Partition. We construct the instance of inCoSchedule where the number of processors is and the deadline is . There are tasks with the same execution times as before (for , if then , otherwise ), and also new identical tasks such that, for , . It is easy to check that Equations (1) and (2) are also fulfilled for the new tasks. If has a solution, we construct the solution to similarly to the previous reduction, and we add to each pack tasks with , each assigned to processors. This solution has an execution time exactly equal to . Conversely, if has a solution, we can verify that there are exactly packs (there are tasks and each pack has an execution time at least equal to ). Then we can verify that there are at most tasks with per pack, since there are exactly processors. Otherwise, if there were (or more) such tasks in a pack, then one of them would be scheduled on less than processors, and the execution time of the pack would be greater than . Finally, we can see that in , each pack is composed of tasks with , scheduled on processors at least, and that there remains triplets of tasks , with , scheduled on at most processors. The end of the proof is identical to the reduction in the case . ∎
Note that the inCoSchedule problem is NPcomplete, and the inCoSchedule problem can be solved in polynomial time, hence inCoSchedule is the simplest problem whose complexity remains open.
4.2 Scheduling a pack of tasks
In this section, we discuss how to optimally schedule a set of tasks in a single pack: the tasks are given, and we search for an assignment function such that , where is the number of processors assigned to task . Such a schedule is called a 1packschedule, and its cost is . In Algorithm 4.2 below, we use the notation if :
[htb]
)
\Begin
\For to
Let be the list of tasks sorted in nonincreasing values of
\While
Insert in according to its value
\Return;
Theorem 3.
Given tasks to be scheduled on processors in a single pack, Algorithm 4.2 finds a 1packschedule of minimum cost in time .
In this greedy algorithm, we first assign one processor to each task, and while there are processors that are not processing any task, we select the task with the longest execution time and assign an extra processor to this task. Algorithm 4.2 performs iterations to assign the extra processors. We denote by the current value of the function at the end of iteration . For convenience, we let for . We start with the following lemma:
Lemma: At the end of iteration of Algorithm 4.2, let be the first task of the sorted list, i.e., the task with longest execution time. Then, for all , .
Proof.
Let be the task with longest execution time at the end of iteration . For tasks such that , the result is obvious since . Let us consider any task such that . Let be the last iteration when a new processor was assigned to task : and . By definition of iteration , task was chosen because was greater than any other task, in particular . Also, since we never remove processors from tasks, we have and . Finally, . ∎
We are now ready to prove Theorem 3.
of Theorem 3.
Let be the 1packschedule returned by Algorithm 4.2 of cost , and let be a task such that . Let be a 1packschedule of cost . We prove below that , hence is a 1packschedule of minimum cost:

If , then has fewer processors in than in , hence its execution time is larger, and .

If , then there exists such that (since the total number of processors is in both and ). We can apply the previous Lemma at the end of the last iteration, where is the task of maximum execution time: , and therefore .
Finally, the time complexity is obtained as follows: first we sort elements, in time . Then there are iterations, and at each iteration, we insert an element in a sorted list of elements, which takes operations (use a heap for the data structure of ). ∎
Note that it is easy to compute an optimal 1packschedule using a dynamicprogramming algorithm: the optimal cost is , which we compute using the recurrence formula
for and , initialized by , and . The complexity of this algorithm is . However, we can significantly reduce the complexity of this algorithm by using Algorithm 4.2.
4.3 Computing the optimal solution
In this section we sketch two methods to find the optimal solution to the general inCoSchedule problem. This can be useful to solve some smallsize instances, albeit at the price of a cost exponential in the number of tasks .
The first method is to generate all possible partitions of the tasks into packs. This amounts to computing all partitions of elements into subsets of cardinal at most . For a given partition of tasks into packs, we use Algorithm 4.2 to find the optimal processor assignment for each pack, and we can compute the optimal cost for the partition. There remains to take the minimum of these costs among all partitions.
The second method is to cast the problem in terms of an integer linear program:
Theorem 4.
The following integer linear program characterizes the inCoSchedule problem, where the unknown variables are the ’s (Boolean variables) and the ’s (rational variables), for and :
(3) 
Proof.
The ’s are such that if and only if task is in the pack and it is executed on processors; is the execution time of pack . Since there are no more than packs (one task per pack), . The sum is therefore the total execution time ( if there are no tasks in pack ). Constraint (i) states that each task is assigned to exactly one pack , and with one number of processors . Constraint (ii) ensures that there are not more than tasks in a pack. Constraint (iii) adds up the number of processors in pack , which should not exceed . Finally, constraint (iv) computes the cost of each pack. ∎
4.4 Approximation algorithm
In this section we introduce packApprox, a approximation algorithm for the inCoSchedule problem. The design principle of packApprox is the following: we start from the assignment where each task is executed on one processor, and use Algorithm 4.4 to build a first solution. Algorithm 4.4 is a greedy heuristic that builds a coschedule when each task is preassigned a number of processors for execution. Then we iteratively refine the solution, adding a processor to the task with longest execution time, and reexecuting Algorithm 4.4. Here are details on both algorithms:
Algorithm 4.4. The inCoSchedule problem with processor preassignments remains strongly NPcomplete (use a similar reduction as in the proof of Theorem 2). We propose a greedy procedure in Algorithm 4.4 which is similar to the First Fit Decreasing Height algorithm for strip packing [6]. The output is a coschedule with at most tasks per pack, and the complexity is (dominated by sorting).
Algorithm 4.4. We iterate the calls to Algorithm 4.4, adding a processor to the task with longest execution time, until: (i) either the task of longest execution time is already assigned processors, or (ii) the sum of the work of all tasks is greater than times the longest execution time. The algorithm returns the minimum cost found during execution. The complexity of this algorithm is (in the calls to Algorithm 4.4 we do not need to resort the list but maintain it sorted instead) in the simplest version presented here, but can be reduced to using standard algorithmic techniques.
[htb]
Makepack()
\Begin
Let be the list of tasks sorted in nonincreasing values of execution times
\While
Schedule the current task on the first pack with enough available processors and fewer than tasks.
Create a new pack if no existing pack fits
Remove the current task from
\Returnthe set of packs
[htb]
packApprox()
\Begin
to
to Let Let be one task that maximizes Call Makepack ()
Let be the cost of the coschedule \lIf \lIf ( ) or ()\ReturnCOST; \tccExit loop \lElse; \tccAdd a processor to
COST;
Theorem 5.
packApprox is a 3approximation algorithm for the inCoSchedule problem.
Proof.
We start with some notations:

step denotes the iteration of the main loop of Algorithm packApprox;

is the allocation function at step ;

is the maximum execution time of any task at step ;

is the index of the task with longest execution time at step (break ties arbitrarily);

is the total work that has to be done at step ;

is the result of the scheduling procedure at the end of step ;

opt denotes an optimal solution, with allocation function , execution time , and total work
Note that there are three different ways to exit algorithm packApprox:

If we cannot add processors to the task with longest execution time, i.e., ;

If after having computed the execution time for this assignment;

When each task has been assigned processors (the last step of the loop “for”: we have assigned exactly processors, and no task can be assigned more than processors).
Lemma 1.
At the end of step , .
Proof.
Consider the packs returned by Algorithm 4.4, sorted by nonincreasing execution times, (some of the packs may be empty, with an execution time ). Let us denote, for ,

the task with the longest execution time of pack (i.e., the first task scheduled on );

the execution time of pack (in particular, );

the sum of the task works in pack ;

the number of processors available in pack when was scheduled in pack .
With these notations, and . For each pack, note that , since is the maximum work that can be done on processors with an execution time of . Hence, .
In order to bound , let us first remark that : otherwise would have been scheduled on pack . Then, we can exhibit a lower bound for , namely . Indeed, the tasks scheduled before all have a length greater than by definition. Furthermore, obviously (the work of the first task scheduled in pack ). So finally we have, .
Summing over all ’s, we have: , hence . Finally, note that , and therefore . Note that this proof is similar to the one for the StripPacking problem in [6]. ∎
Lemma 2.
At each step , and , i.e., the total work is increasing and the maximum execution time is decreasing.
Proof.
Lemma 3.
Given an optimal solution opt, and .
Proof.
The first inequality is obvious. As for the second one, is the maximum work that can be done on processors within an execution time of , hence it must not be smaller than , which is the sum of the work of the tasks with the optimal allocation. ∎
Lemma 4.
For any step such that , then , and .
Proof.
Lemma 5.
For any step such that , then .
Lemma 6.
There exists such that (we let ).
Proof.
We show this result by contradiction. Suppose such does not exist. Then (otherwise would suffice). Let us call the last step of the run of the algorithm. Then by induction we have the following property, (otherwise would exist, hence contradicting our hypothesis). Recall that there are three ways to exit the algorithm, hence three possible definitions for :

, but then we have the same result, i.e., because this is true for all tasks.

, but this is false according to Lemma 5.
We have seen that packApprox could not have terminated at step , however since packApprox terminates (in at most steps), we have a contradiction. Hence we have shown the existence of . ∎
Lemma 7.
.
Proof.
Consider step . If , then at this step, all tasks are scheduled on exactly one processor, and . Therefore, . If , consider step : . From Lemma 4, we have . Furthermore, it is easy to see that since no task other than is modified. We also have the following properties:

;

(by definition of step );

(Lemma 3);

.
The three first properties and Equation (1) allow us to say that . Thanks to the fourth property, . Finally, we have, for all , and therefore by Equation (2). ∎
5 Heuristics
In this section, we describe the heuristics that we use to solve the inCoSchedule problem.
RandomPack– In this heuristic, we generate the packs randomly: as long as there remain tasks, randomly choose an integer between 1 and , and then randomly select tasks to form a pack. Once the packs are generated, apply Algorithm 4.2 to optimally schedule each of them.
RandomProc– In this heuristic, we assign the number of processors to each task randomly between and , then use Algorithm 4.4 to generate the packs, followed by Algorithm 4.2 on each pack.
A word of caution– We point out that RandomPack and RandomProc are not pure random heuristics, in that they already benefit from the theoretical results of Section 4. A more naive heuristic would pick both a task and a number of processor randomly, and greedily build packs, creating a new one as soon as more than resources are assigned within the current pack. Here, both RandomPack and RandomProc use the optimal resource allocation strategy (Algorithm 4.2) within a pack; in addition, RandomProc uses an efficient partitioning algorithm (Algorithm 4.4) to create packs when resources are preassigned to tasks.
packApprox– This heuristic is an extension of Algorithm 4.4 in Section 4.4 to deal with packs of size rather than : simply call Makepack () instead of Makepack (). However, although we keep the same name as in Section 4.4 for simplicity, we point out that it is unknown whether this heuristic is a approximation algorithm for arbitrary .
packbypack ()– The rationale for this heuristic is to create packs that are wellbalanced: the difference between the smallest and longest execution times in each pack should be as small as possible. Initially, we assign one processor per task (for , ), and tasks are sorted into a list ordered by nonincreasing execution times ( values). While there remain some tasks in , let be the first task of the list, and let . Let be the ordered set of tasks such that : this is the sublist of tasks (including as its first element) whose execution times are close to the longest execution time , and is some parameter. Let be the total number of processors requested by tasks in . If , a new pack is created greedily with the first tasks of , adding them into the pack while there are no more than processors used and no more than tasks in the pack. The corresponding tasks are removed from the list . Note that is always inserted in the created pack. Also, if we have , then a new pack with only is created. Otherwise (), an additional processor is assigned to the (currently) critical task , hence , and the process iterates after the list is updated with the insertion of the new value for Finally, once all packs are created, we apply Algorithm 4.2 in each pack, so as to derive the optimal schedule within each pack.
We have . A small value of will lead to balanced packs, but may end up with a single task with processors per pack. Conversely, a large value of will create new packs more easily, i.e., with fewer processors per task. The idea is therefore to call the heuristic with different values of , and to select the solution that leads to the best execution time.
Summary of heuristics– We consider two variants of the random heuristics, either with one single run, or with different runs, hence hoping to obtain a better solution, at the price of a slightly longer execution time. These heuristics are denoted respectively RandomPack1, RandomPack9, RandomProc1, RandomProc9. Similarly, for packbypack, we either use one single run with (packbypack1), or runs with (packbypack9). Of course, there is only one variant of packApprox, hence leading to seven heuristics.
Variants– We have investigated variants of packbypack, trying to make a better choice than the greedy choice to create the packs, for instance using a dynamic programming algorithm to minimize processor idle times in the pack. However, there was very little improvement at the price of a much higher running time of the heuristics. Additionally, we tried to improve heuristics with up to runs, both for the random ones and for packbypack, but here again, the gain in performance was negligible compared to the increase in running time. Therefore we present only results for these seven heuristics in the following.
6 Experimental Results
In this section, we study the performance of the seven heuristics on workloads of parallel tasks. First we describe the workloads, whose application execution times model profiles of parallel scientific codes. Then we present the measures used to evaluate the quality of the schedules, and finally we discuss the results.
Workloads–
WorkloadI
corresponds to parallel scientific applications that involve
VASP [14], ABAQUS [3], LAMMPS [17] and
Petsc [1]. The execution times of these applications were
observed on a cluster with Intel Nehalem 8core nodes connected by
a QDR Infiniband network with a total of cores. In other words,
we have processors, and each processor is a multicore node.
WorkloadII is a synthetic test suite that was designed to
represent a larger set of scientific applications. It models tasks
whose parallel execution time for a fixed problem size on
cores is of the form , where can be interpreted as the inherently serial
fraction, and represents overheads related to synchronization
and the communication of data. We consider tasks with sequential times
of the form , , and , where is
a suitable constant. We consider values of in , with overheads of the form , , , , , and
to create a workload with tasks executing on up to cores.
The same process was also used to develop WorkloadIII, our largest
synthetic test suite with tasks for cores (and
multicore nodes), to study the
scalability of our heuristics. For all workloads, we modified speedup
profiles to satisfy Equations (1) and (2).
As discussed in related work (see Section 2) and [21], and confirmed by power measurement using Watts Up Pro meters, we observed only minor power consumption variations of less than 5% when we limited coscheduling to occur across multicore nodes. Therefore, we only test configurations where no more than a single application can be scheduled on a given multicore node comprising 8 cores. Adding a processor to an application which is already assigned processors actually means adding 8 new cores (a full multicore node) to the existing cores. Hence a pack size of corresponds to the use of at most cores for applications in each pack. For WorkloadsI and II, there are nodes and cores, while WorkloadIII has up to nodes and cores.
Methodology for assessing the heuristics– To evaluate the quality of the schedules generated by our heuristics, we consider three measures: Relative cost, Packing ratio, and Relative response time. Recall that the cost of a pack is the maximum execution time of a task in that pack and the cost of a coschedule is the sum of the costs over all its packs.
We define the relative cost as the cost of a given coschedule
divided by the cost of a 1pack schedule, i.e., one with each task running
at maximum speed on all processors.
For a given inCoSchedule, consider
, i.e., the total work performed in
the coschedule when the th task is assigned processors.
We define the packing ratio as this sum divided by
times the cost of the coschedule; observe that the packing quality is
high when this ratio is close to 1, meaning that there is almost no idle time
in the schedule.
An individual user could be
concerned about an increase in response time and a corresponding
degradation of individual productivity. To assess the impact on response
time, we consider the performance with respect to a relative
response time measure defined as follows. We consider a 1pack schedule
with the tasks sorted in nondecreasing order of execution time, i.e.,
in a ”shortest task first” order, to yield a minimal value of the response time.
If this ordering is given by the permutation ,
the response time of task is
and the mean response time is .
For a given inCoSchedule with packs
scheduled in increasing order of the costs of a pack, the
response time of task in pack , , assigned to processors, is:
, where is
the cost of the th pack for . The mean response time
of the inCoSchedule is calculated using these values
and we use as the relative response time.
Results for small and medium workloads– For Wo