On the Optimality of Scheduling Dependent MapReduce Tasks on Heterogeneous Machines
Abstract
MapReduce is the most popular bigdata computation framework, motivating many research topics. A MapReduce job consists of two successive phases, i.e., map phase and reduce phase. Each phase can be divided into multiple tasks. A reduce task can only start when all the map tasks finish processing. A job is successfully completed when all its map and reduce tasks are complete. The task of optimally scheduling the different tasks on different servers to minimize the weighted completion time is an open problem, and is the focus of this paper. In this paper, we give an approximation ratio with a competitive ratio , where is the number of servers and is the taskskewness product. We implement the proposed algorithm on Hadoop framework, and compare with three baseline schedulers. Results show that our DMRS algorithm can outperform baseline schedulers by up to .
I Introduction
Big Data has emerged in the past few years as a new paradigm presenting abundant oppurtunities and challenges, including efficient processing and computation of data. This has led to the development of parallel computing frameworks, such as MapReduce [1], designed to process massive amounts of data. MapReduce has two fundamental processes, map and reduce. Input data are first split into smaller segments that are processed by parallel map tasks on different machines. The intermediate output, consisting of keyvalue pairs, are then processed by reduce tasks to obtain the final result. Due to the increasing level of heterogeneity in both application requirements and computing infrastructures, scheduling algorithms for MapReduce framework have been widely studied with the goal of reducing job completion times [2, 3, 4, 5, 6].
A key challenge in designing optimal MapReduce schedulers is the dependence between map and reduce tasks [7, 8]. More specifically, a reduce task can only start when all map tasks associated with the same job are completed, leading to precedence constraints between map and reduce tasks of each job. In this paper, we consider the problem of optimally scheduling dependent map and reduce tasks on heterogeneous machines to minimize the weighted completion time, which is motivated by the existence of differet job priorities and requirements. The problem has been studied dating back to the 1950s [9], but still remains an open problem despite recent progress on approximation algorithms developed in a few special cases [10, 11, 12, 13, 14, 15].
When each job has a single task, thus giving no precedence constraints for dependent map and reduce tasks, the scheduling problem has been studied as the Unrelated Machine Scheduling problem, where tasks can have arbitrary processing time on different machines. Unrelated Machine Scheduling is known to be strongly NPhard even in the single machine setting, and are APXhard even when all jobs are available to schedule at time 0 [15] (referred to as zero release time). Different approximation algorithms have been proposed for the problem in [12, 16, 17, 15, 13, 18, 14]. This problem is listed in [17] as one of the top ten open problems in the field of approximate scheduling algorithms, and the best known result is a 1.5approximation algorithm for zero release time [14], and 1.8786approximation algorithm for arbitrary arrival times [15]. In a separate line of work, under a strong assumption that machines are identical and have the same processing time for each task, performance bounds for scheduling dependent map and reduce tasks are derived in [10, 11]. Related work also include [19, 20, 21, 22, 23]. However, rather than focusing directly on the weighted job completion time, a different problem of minimizing the total completion time of a sequence of jobs is considered in [19, 20, 21], while the approach in [22, 23] has assumed speedup of the machines.
In this paper, we consider the general problem of minimizing weighted completion time under precedence constraints (for dependent map and reduce tasks) and on heterogeneous machines (with different processing speed). We develop a approximation algorithm when all jobs are released at time 0, where is the number of machines and is a metric quantifying the taskskewness product of map and reduce tasks, defined as the sum processing size of all map and reduce tasks divided by that of the largest map and reduce tasks Thus, we have , and the value of increases with increased number of tasks, and is also higher for a leftskewed tasksize distribution (i.e., a larger percentage of largesize jobs). The competitive ratio becomes for general job release times. The key idea of our approach is to schedule map and reduce tasks by solving an approximated linear program, in which the dependence between map and reduce tasks is cast into (precedence) constraints with respect to map/reduce completion times. Then, we show that the proposed linear program can be efficiently computed in polynomial time. It yields a scheduling algorithm with provable competitive ratio compared to the optimal scheduling solution. For dependent map and reduce task scheduling on heterogeneous machines with arbitrary processing speeds and under precedence constraints, our result advances the state of the art – 37.87approximation algorithm proposed in [24] and 54approximation algorithm proposed in [25] (which are yet only for zero release times) – and achieves nearly optimal performance when the jobs to be processed contain a large number of tasks, as it leads to higher value and thus tighter competitive ratios for both zero and arbitrary release times.
The proposed scheduler is implemented in Hadoop. A key feature of the implementation is its ability to adapt to system execution dynamics and uncertainty. In particular, we implement a task scheduler that not only computes an optimal schedule of map and reduce tasks according to the proposed solution, but also has the ability to reoptimize the schedule on the fly based on available task progress and renewed completion time estimates. We also modify the Application Master and Resource Manager in Hadoop to ensure the task/job execution in the desired order, as well as to handle potential desynchronization and disconnection issues. Our extensive experiments, using a combination of WordCount, Sort, and TeraSort benchmarks on a heterogeneous cluster, validate that our proposed DMRS outperforms FIFO, Identicalmachine, and Maponly, in terms of total weighted completion time. Especially, DMRS can achieve the smallest total weighted completion time for scheduling benchmarks with heavy workloads in reduce phase, e.g., TeraSort.
The main contributions of the paper are as follows.

We consider the optimization of weighted completion time by scheduling dependent map and reduce tasks under precedence constraints and on heterogeneous machines, and propose an approximation algorithm for the problem.

The proposed scheduling algorithm is based on the solution of an approximated linear program, which recasts the precedence constraints and is shown to be solvable in polynomial time.

We analyze the proposed scheduling algorithm and quantify its approximation ratio with both zero and arbitrary release times, which significantly improves prior art, especially when the number of tasks per job is large.

We implement the proposed algorithm on Hadoop framework, and thoroughly compare it with other schedulers such as FIFO, Identicalmachine, and Maponly. Results show that DMRS outperforms FIFO, Identicalmachine, and Maponly, in terms of total weighted completion time, by up to , , and , respectively.
The rest of the paper is organized as follows. We present the system model and formulate the problem in Section II. The approximation algorithm and its analysis are provided in Section III. The implementation details of DMRS is presented in Section V. Section VI provides the experimental evaluation of DMRS, comparing it with other schedulers.
Ii System Model and Problem Formulation
Consider a computing system consisting of heterogeneous, parallel (physical or virtual) machines, and each machine has speed , . Without loss of generality, we assume that are sorted in descending order, i.e., . A set of jobs are submitted to the system, and the release time (i.e., the earliest time a job can be processed) of job is . Each job contains a set of map tasks (denoted as ) and a set of reduce tasks (denoted as ). For each job , each map task is required to process data of size , and each reduce task to process data of size . Without loss of generality, we assume and are decreasing in . Under our model, a task to process data size takes time to complete when running on machine that has a speed . Different tasks of the same job may be processed concurrently on different machines. Due to the precedence constraints, the reduce tasks of each job can start only after all its map tasks are completed.
We introduce some notations that are employed in this paper to simplify the analysis and discussions. Let denote the total processing rates of machines, i.e., . Further, let denote the maximum number of concurrent map and reduce tasks, i.e., , which can be scheduled in parallel. We define to be the sum processing speed of the fastest machines, or . It is easy to see that is the maximum possible processing speed of job , since its tasks can only occupy distinct machines at any given time. We denote the total processing data size of all map tasks of job as , i.e., , and the total processing data size of all reduce tasks of job as , i.e., . Let be the sum of and . Finally, we use to denote the completion time of all map tasks, and the completion time of all reduce tasks of job , which is also the completion time of job since map tasks complete before reduce tasks.
In this paper, our goal is to find an algorithm that schedules different tasks on heterogeneous parallel machines, so as to minimize the weighted completion time of the jobs, i.e., , where is a nonnegative weight reflecting the priority of job . This problem is strongly NPhard, even when each job has a single task (i.e., ) since it becomes an unrelated machine scheduling problem [15]. In this paper, we will provide an approximation algorithm with provable competitive ratio for the proposed problem.
Iii approximation algorithm
In this section, we develop an algorithm to solve the weighted completion time minimization problem on heterogeneous machines and under precedence constraints. The algorithm is based on first solving a linear optimization, referred to as the LPMapReduce problem. The solution is then used to obtain a feasible schedule executing map and reduce tasks on the machines.
Iiia LPMapReduce
We formulate the LPMapReduce as follows:
(1) 
subject to:
(2)  
(3)  
(4)  
(5)  
(6) 
We note that constraint (2) is based on the Queyranne’s constraint set [28], which has been used to give 2approximation for concurrent open shop scheduling [29, 30] without precedence constraints. The extension to machines with different processing speeds is due to [18], which formulated different versions of the polyhedral constraints based on Queyranne’s constraint set. Our constraint (2) is similar to that in [18], but now applied to reduce job completion times, because all map tasks must be completed before reduce tasks under precedence constraints, and thus, is also the completion time of the entire job . While (2) does not explicitly force map tasks to finish before reduce jobs, it states that the completion of a job implies finishing all its map and reduce tasks.
Constraint (3) means that the completion time of all map tasks is at least the release time (i.e., the earliest time job can begin processing) plus the required processing time of any map task on the fastest machine (i.e., the minimum required processing time of any map task). Similarly, constraint (4) requires the completion time of all reduce tasks to be at least the completion time of all map tasks plus the time needed to finish any reduce tasks on the fastest machine. This is due to the precedence constraint, forcing reduce tasks to start after all map tasks are finished. Finally, constraint (5) implies that the time required to process all map tasks (i.e., from to ) is at least the total data size of all the map tasks divided by the maximum possible processing speed of job . Similarly, (6) means that the reduce task completion time must be at least the completion time of all map tasks plus the minimum processing time of all reduce tasks, given by the total data size of all reduce tasks divided by the maximum possible processing speed of job . We note that the constraints (3)(6) do not account for multiple jobs this providing loose bounds. Furthermore, we also note that the constraints (3)(6) can be made concise by reducing to a single combined constraint:
(7)  
Because constraints (2) and (7) are necessary for any feasible solution of the weighted completion time optimization, any optimal solution of the LPMapReduce provides a lower bound for the weighted completion time optimization. This lower bound may not be tight, and the optimal solution may not be feasible in the original optimization, since LPMapReduce does not take into account all sufficient constraints. Nevertheless, we show that a feasible schedule for executing map and reduce tasks on different machines can be obtained from the optimal solution of the LPMapReduce.
We note that in the proof of the approximation ratio, we only use the constraints (2) and (which follows from (7)). These constraints do not consider the precedence constraints and only account for all the mapreduce tasks being completed from the servers. Thus, even though the LP formulation with these two constraints do not account for the precedence constraints, we note that the proposed algorithm will be shown to have approximation guarantees in the case when the precedence constraints are present.
IiiB Complexity of Solving LPMapReduce
At a first look, even though LPMapReduce is a linear program, the constraint (2) takes every possible subset of and thus contains different constraints, one for each . We show that utilizing a special structure of these linear constraints, the LPMapReduce problem can actually be solved in polynomial time. To this end, we make use of the Ellipsoid method [31], which needs a separation oracle to determine the violated constraint. In order to find such a separation oracle, we will first determine the most violated constraint in (2) since there are only other constraints in (7). Let the violation for a set be defined as follows.
(8) 
Let for all be a potentially feasible solution to LPMapReduce. Let denote the ordering when jobs are sorted in an increasing order of . Find the most violated constraint in (2) by searching over for of the form , . If any of maximal , then return as a violated constraint for (2). Otherwise, check the remaining constraints (7) directly in linear time.
OracleLP finds the subset of jobs that maximizes the “violation”. That is, OracleLP finds such that . We prove the correctness of OracleLP by establishing a necessary and sufficient condition for a job to be in .
Lemma 1.
Let . Then, we have
Proof.
The proof follows on the same lines as in [18] and is thus omitted. ∎
Given Lemma 1, it is easy to verify that sorting jobs in increasing order of to define a permutation guarantees that is of the form for some . This implies that OracleLP finds in time. Since the remaining constraint (7) can be verified in linear time, OracleLP runs in time. Thus, the LPMapReduce problem with Ellipsoid method [31] using the above separation oracle is solvable in polynomial time.
IiiC Proposed Algorithm
Let be the optimal completion times found by the LPMapReduce algorithm. Our algorithm to identify a feasible schedule for the weighted completion time optimization consists of the following steps. First, we sort the jobs, with respect to , in an ascending order. The sorted jobs are labeled , , where is the corresponding permutation of the jobs. Next, we schedule map and reduce tasks onebyone, beginning with map tasks from the one with highest processing data size to the lowest (i.e., for ), and then scheduling reduce tasks in the same order. When scheduling each (map or reduce) task , we assign it to a machine that produces the earliest completion time, with respect to all tasks already assigned to the machine, the task ’s release time, and the required processing speed of task on the machine. Finally, once all (map and reduce) tasks are assigned, the order of task executions on each machine is determined. We then insert idle time on all machines as necessary, if any job’s reduce tasks have an earlier starting time than the completion of all its map tasks, and if any jobs begin processing ahead of its release time. This procedure ensures the precedence constraints between map and reduce tasks, as well as the feasibility of job starting times. The pseudocode of our proposed Dependent MapReduce Scheduling (DMRS) algorithm is shown in Algorithm 1.
Iv Proof of DMRS Approximation Ratio
We analyze the approximation ratio of DMRS algorithm. Let be the completion time of job from the proposed DMRS algorithm.
Lemma 2.
Suppose jobs are scheduled using DMRS. The completion time of job , for all , satisfy
(9)  
Proof.
For now, assume all jobs are released at time zero. Let be the reduce task of job that finishes the last. Further, let denote the total demand for machine and idle time involved once all the tasks of through except the task are scheduled. Then, we have
(10) 
This is because is the overall load of the jobs. However, there is an additional idle time between the map jobs and the reduce jobs for all the jobs for that need to be accounted in the overall load. The idle time after map job is less than the time for processing job of data size at the server. If not, there is a server that is idle for time that is larger than that required to process after all map tasks of job are completed. Then, the last completed map task of could be shifted to this server decreasing the overall map completion time thus invalidating the policy where the machine is assigned such that it results in the task being completed as early as possible. Further, the machine that finishes the map task the latest would not have idle time, and thus the total idle time at machines is each at most that to process making this extra load summed over all the machines as at most .
The completion time of the job depends on the last completed reduce task , and thus, we have (The job assigned on any server will give an upper bound on the completion time of the job). Summing this equation over all , we get
(11) 
Now suppose that some . We take our policy to the extreme and suppose that all machines are left idle until every one of jobs through are released. Note that this occurs precisely at time . It is clear that beyond this point in time, we are effectively in the case where all jobs are released at time zero, hence we can bound the remaining time to completion by the above expression and thus we have
(12) 
Since, , we get the result as in the statement of the Lemma. ∎
Let be the largest number s.t. for all jobs . We note that the above trivially holds for , and thus . We now use the above Lemma to give an approximation analysis for the proposed DMRS algorithm. Larger possible value of helps achieve a tighter bound. We call as a taskskewness product, since it can be viewed as the number of tasks times the mean tasksize divided by the sum of tasksizes for the largest map and the largest reduce tasks. Thus, for the same skewness, larger number of tasks increase , and more larger jobs increase mean with the same maximum thus helping the skewness and thus the value of . Let be the completion times given by the LPMapReduce solution, and be the completion times of the optimal solution. Let be the objective value for the optimal schedule. Since any feasible solution satisfies the constraints of LPMapReduce, we have . Then, we have the following approximation result.
Theorem 1.
If , then . Otherwise .
Proof.
Since , it is enough to show that .
From (2), we have
(13) 
Taking , we have
(14)  
(15) 
This further reduces to
(16) 
We first consider the case when . Using Lemma 2, we have
This proves the result for for all . If that is not the case, we note from (5) and (6) that . Thus, .
Thus, we have for every ,
(19)  
Thus, we get the approximation with additional gap of thus giving the result as in the statement of the Theorem. ∎
Theorem 1 shows that the proposed DMRS algorithm achieves an objective value with a competitive ratio . We note that is a metric quantifying the taskskewness product of map and reduce tasks, defined as the sum processing data sizes of all map and reduce tasks divided by that of the largest map and reduce tasks. Since , the competitive ratio is at most , while tighter bounds are obtained for larger , i.e., as the number of tasks per job increases and there is a higher percentage of largesize jobs (i.e., leftskewed).
V Implementation
We implement our proposed DMRS scheduler in Hadoop. It consists of three key modules: a job scheduler that solves the LPMapReduce problem to determine the scheduling order of different jobs, a task scheduler that is responsible for scheduling map and reduce tasks on different machines, and an execution database that stores statistics of previously executed jobs/tasks for estimating task completion times. A key feature of our implementation is its ability to adapt to system execution dynamics and uncertainty. In particular, DMRS’s task scheduler not only computes an optimal schedule of map and reduce tasks according to Algorithm 1, but also has the ability to reoptimize the schedule on the fly based on available task progress and renewed completion time estiamtes. We also modify the Application Master (AM) and Resource Manager (RM) in Hadoop, which works in collaboration with task scheduler to ensure the execution of tasks in the desired order.
By default, Hadoop consists of three base modules, i.e., Hadoop distributed file system (HDFS), Yarn, and MapReduce. Yarn is responsible for managing computing resources, and RM is its core component for allocating resource containers to running applications. Each MapReduce application has a dedicated AM instance. It is responsible for negotiating resources with RM, and assigning containers to tasks. Heartbeat messages are sent continuously from AM to RM during an execution to update application states and container demands.
Our DMRS scheduler works as follows. First, the job scheduler loads necessary job parameters and queries the execution database for estimated machine speeds , to formulate and solve the LPMapReduce problem. The optimal job schedule is input to the task scheduler to find the schedule and placement of every map and reduce task according to Algorithm 1. Next, based on the task schedule and placement, the RM assigns a queue to each machine to store all map and reduce tasks that are scheduled to run on it. Tasks in each machine queue are then processed in a FIFO manner, guaranteeing the execution of jobs/tasks under our proposed algorithm. In particular, each task is given a unique ID. When resources become available on machine , the RM launches a container and associates it with the headofline task. The container and taskID pair are sent to the AM for launching the desired task.
To adapt our DMRS scheduler under system execution dynamics and uncertainty, the task scheduler continuously monitors task/job progress through AMs, refines the estimate of completion time, and if necessary, reoptimizes task schedules on the fly. More precisely, before launching each (map or reduce) task , the task schedule estimates the completion time of all jobs/tasks scheduled before on each machine . The time is obtained by combining known task completion times (which are available from execution database) and estimating the remaining times of active tasks (which are calculated by each AM using the remaining data size divided by machine speed). Then, the optimization in Algorithm 1 is repeated at runtime to find the optimal machine for task . If is a map task, we have
(20) 
where is the release time of job containing map task . If is a reduce task, we need to take the precedence constraint into account, i.e.,
(21) 
where is the estimated completion time of job ’s map tasks. A new optimization of all remaining tasks by the task scheduler is triggered if is different from the previous solution. This makes our DMRS schedule robust to any possible execution uncertainty and estimation errors.
We also implement additional features in both AM and RM to make them fault tolerant. The container and taskID pairs are duplicated at each AM in advance (after an optimal schedule is computed by the job and task schedulers). If RM accidentally sends an incorrect container that is intended for application (e.g., due to lack of synchronization), AM will detect such inconsistency and immediately release the container back to RM. Further, a mechanism to handle occasional disconnection is implemented in both AM and RM, allowing them to buffer current containers/tasks and attempt reconnection.
Vi Evaluation
In this section, we evaluate DMRS on a Hadoop cluster with three benchmarks, viz., WordCount, Sort, and TeraSort. We compare DMRS with FIFO, Identicalmachine, and Maponly schedulers. FIFO is provided by Hadoop, and FIFO schedules jobs based on the jobs’ releasing order. In a job, FIFO schedules a task to the first available machine, and reduce tasks before all map tasks scheduled when it estimates that there are enough cluster resources for scheduling all map tasks. Identicalmachine assumes all machines are identical, and applies Algorithm 1 to schedule jobs and tasks. Maponly considers the map phase is the most critical, and employs Algorithm 1 to schedule jobs and tasks without considering the reduce phase.
Via Experimental setup
We set up a heterogeneous cluster. The cluster contains (virtual) machines, and each machine consists of a physical core and 8GB memory. Each machine can process one task at a time. In the cluster, machines are connected to a gigabit ethernet switch and the link bandwidth is 1Gbps. The heterogeneous cluster contains two types of machines, fast machines and slow machines. The processing speed ratio between a fast machine and a slow machine is . We evaluate DMRS by using three benchmarks – WordCount, Sort, and TeraSort. WordCount is a CPUbound application, and Sort is an I/Obound application. TeraSort is CPUbound for map phase, and I/O bound for reduce phase. We download workload for WordCount from Wikipedia, and generate workloads for Sort and TeraSort by using RandomWriter and TeraGen applications provided by Default Hadoop. The number of reduce tasks per job is set based on workload of the reduce phase. We set the number of reduce tasks per job in WordCount to be 1, and in Sort and TeraSort to be 4. All jobs are associated with weights, and values of weights are uniformly distributed between 1 to 5. Also, all jobs are partitioned into two releasing groups, and each group contains the same number of jobs. The releasing time interval between two groups is sec. The completion time of a job is measured by the hour.
ViB Experimental results
In the first set of experiments, each experiment contains 20 jobs, and the workload of a job is 1GB. The task sizes of all jobs are the same, and equal 64MB. Figure 2 shows that DMRS outperforms FIFO, Identicalmachine, and Maponly by up to , , and , respectively. Identicalmachine has the largest total weighted completion time (TWCT). The reason is that it distributes the same number of tasks to each machine. The job’s completion time is dominated by the tasks’ completion time running on slow machines. Also, Identicalmachine results in a large amount of cluster resources being wasted, since fast machines need to wait for slow machines to finish their tasks. FIFO schedules jobs based on the jobs’ release order. Jobs with high weights (timesensitive jobs) cannot be scheduled first, so timesensitive jobs cannot be completed in time. For task scheduling, FIFO does not consider the heterogeneous cluster environment, and tasks are scheduled to the first available container. Such task scheduling scheme can increase the completion time of tasks, since a container which becomes available later might be launched on a fast machine and be able to complete a task faster. Also, FIFO might schedule reduce tasks soon after the job is scheduled, and before the last map task is scheduled. Even though such scheme leaves more time for reduce tasks to fetch data from map tasks’ outputs, given the large available network bandwidth nowadays, reduce tasks only need a little time for fetching data from all map tasks’s outputs. Maponly schedules jobs and tasks without considering the reduce phase. Under benchmarks with light workloads in the reduce phase, e.g., under WordCount benchmark, Maponly can achieve comparable performance as DMRS. However, if the workload of the reduce phase is comparable with the map phase’s, e.g., under Sort, scheduling jobs without considering the workload of reduce phases and scheduling reduce tasks to random machines result in performance degradation and increase in TWCT. Furthermore, under a benchmark with heavy workload in the reduce phase, e.g., TeraSort, TWCT is dominated by the completion time of reduce tasks, and Maponly increases TWCT by , compared with DMRS.
Figure 2 shows the results of evaluating DMRS, in terms of TWCT, by employing jobs with different task sizes and different amount of workloads. In the set of experiments, whose results are shown Figure 2, each experiment contains 20 jobs. Of these 20 jobs, 12 jobs need to process 1GB data each, and the task size is 64MB; 4 jobs need to process 0.5GB each, with a task size of 32MB; and the remaining 4 jobs need to process 2GB, with a task size of 128MB. Figure 2 shows that DMRS outperforms FIFO, Identicalmachine, and Maponly by up to , , and , respectively. Under WordCount and TeraSort, the TWCT of FIFO is much larger than TWCT of other schedulers’ in Figure 2 and TWCT of FIFO in Figure 2. This is because several jobs schedule reduce tasks soon after the jobs are scheduled; those reduce tasks occupy all fast machines, and since they cannot start to process data until all map tasks complete, and all map tasks are scheduled on slow machines. Even though jobs’ completion time of FIFO has large variation, based on our results, DMRS can outperform FIFO by at least . Also, introducing jobs with large workloads (large jobs) also increases the TWCT of FIFO, since it does not consider jobs’ and tasks’ workloads in job scheduling. Jobs with small workloads (small jobs) might be scheduled after large jobs, and this makes small jobs suffer from the starvation problem.
We further increase the number of large jobs and small jobs, and set the number of jobs in each experiment to be 18 to make the total workload of jobs in each experiment be roughly the same as the first two sets of experiments’. Of the 18 jobs, 6 jobs need to process 1GB data each, with a task size of 64MB; another 6 jobs need to process 0.5GB each, with a task size of 32MB; and the remaining 6 jobs need to process 2GB each, with a task size of 128MB. Figure 2 shows that as the number of large jobs increases, DMRS has low TWCT, since small jobs with large weights do not suffer from the starvation problem.
In the final experiment, we fix the number of TeraSort jobs, i.e., 18 jobs, and change the number of elephant jobs. We set the task size to be 64MB. An elephant job needs to process 2GB data, and a mice job needs to process 0.5GB data. Figure 3 shows that DMRS outperforms FIFO, Identicalmachine, and Maponly by up to , , and , respectively. As the number of elephant jobs increases, mice jobs with large weights might be scheduled after more elephant jobs, and this results in long waiting times for mice jobs, causing large increase in TWCT. Also, as the number of elephant jobs increases, the total workload increases, and the long time occupied on fast machines by reduce tasks before all map tasks finish increases TWCT greatly, since more map tasks have to process data on slow machines. By comparing TWCT of Identicalmachine, Maponly, and DMRS, we observe that as the number of elephant jobs increases, the increments of Identicalmachine and Maponly are much larger than DMRS’s. For Identicalmachine, increasing the number of elephant jobs means the difference of the amount of time used to complete all assigned tasks on fast machines and on slow machines increases. Without considering the scheduling of reduce tasks, as the number of elephant jobs increases, more workloads of reduce tasks are assigned to slow machines, and this results in a large increase in TWCT.
Vii Conclusions
This paper considers scheduling on MapReduce jobs on machines with different speeds. The precedence constraint between the map tasks and the reduce tasks in MapReduce jobs is captured to give a scheduling algorithm that optimizes the weighted completion time of all jobs. The problem is NPhard and the proposed solution uses scheduling of different tasks on the servers using a solution of a linear program, that can be solved in polynomial time. The proposed approach is shown to be approximately optimal, with a competitive ratio of , where is the number of servers and is the taskskewness product. The competitive ratio is shown to be when all the jobs are released at time . The algorithm is implemented on Hadoop framework, and compared with other schedulers. Results demonstrate significant improvement of our proposed algorithm as compared to the baseline schedulers.
References
 [1] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Commun. ACM, vol. 51, pp. 107–113, Jan. 2008.
 [2] Q. Chen, D. Zhang, M. Guo, Q. Deng, and S. Guo, “Samr: A selfadaptive mapreduce scheduling algorithm in heterogeneous environment,” in IEEE CIT, 2010, pp. 2736–2743.
 [3] Z. Tang, J. Zhou, K. Li, and R. Li, “A mapreduce task scheduling algorithm for deadline constraints,” Cluster computing, vol. 16, no. 4, pp. 651–662, 2013.
 [4] L. Thomas and R. Syama, “Survey on mapreduce scheduling algorithms,” International Journal of Computer Applications, vol. 95, no. 23, 2014.
 [5] Z. Huang, B. Balasubramanian, M. Wang, T. Lan, M. Chiang, and D. H. K. Tsang, “Need for speed: CORA scheduler for optimizing completiontimes in the cloud,” in IEEE INFOCOM, 2015, pp. 891–899.
 [6] W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang, “Maptask scheduling in mapreduce with data locality: Throughput and heavytraffic optimality,” IEEE/ACM Trans. Netw., vol. 24, no. 1, pp. 190–203, Feb. 2016.
 [7] J. Tan, X. Meng, and L. Zhang, “Coupling task progress for mapreduce resourceaware scheduling,” in IEEE INFOCOM, 2013, pp. 1618–1626.
 [8] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, I. Stoica, M. Zaharia, D. Borthakur, J. Sen, S. Khaled, E. Scott, and S. I. Stoica, “Job scheduling for multiuser mapreduce clusters,” Technical Report, University of Califonia, Berkley, Tech. Rep., 2009.
 [9] W. E. Smith, “Various optimizers for singlestage production,” Naval Research Logistics (NRL), vol. 3, no. 12, pp. 59–66, 1956.
 [10] F. Chen, M. S. Kodialam, and T. V. Lakshman, “Joint scheduling of processing and shuffle phases in mapreduce systems,” in IEEE INFOCOM, 2012, pp. 1143–1151.
 [11] Y. Yuan, D. Wang, and J. Liu, “Joint scheduling of mapreduce jobs with servers: Performance bounds and experiments,” in IEEE INFOCOM, 2014, pp. 2175–2183.
 [12] H. Chang, M. S. Kodialam, R. R. Kompella, T. V. Lakshman, M. Lee, and S. Mukherjee, “Scheduling in mapreducelike systems for fast completion time,” in IEEE INFOCOM, 2011, pp. 3074–3082.
 [13] M. Sviridenko and A. Wiese, “Approximating the configurationlp for minimizing weighted sum of completion times on unrelated machines,” in International Conference on Integer Programming and Combinatorial Optimization. Springer, 2013, pp. 387–398.
 [14] N. Bansal, A. Srinivasan, and O. Svensson, “Liftandround to improve weighted completion time on unrelated machines,” in Proceedings of the fortyeighth annual ACM symposium on Theory of Computing. ACM, 2016, pp. 156–167.
 [15] S. Im and S. Li, “Better unrelated machine scheduling for weighted completion time via random offsets from nonuniform distributions,” in IEEE (FOCS). IEEE, 2016, pp. 138–147.
 [16] M. Skutella, “Convex quadratic and semidefinite programming relaxations in scheduling,” Journal of the ACM (JACM), vol. 48, no. 2, pp. 206–242, 2001.
 [17] P. Schuurman and G. J. Woeginger, “Polynomial time approximation algorithms for machine scheduling: Ten open problems,” Journal of Scheduling, vol. 2, no. 5, pp. 203–213, 1999.
 [18] R. Murray, S. Khuller, and M. Chao, “Scheduling distributed clusters of parallel machines: Primaldual and lpbased approximation algorithms [full version],” arXiv preprint arXiv:1610.09058, 2016.
 [19] Y. Zheng, N. B. Shroff, and P. Sinha, “A new analytical technique for designing provably efficient mapreduce schedulers,” in IEEE INFOCOM, 2013, pp. 1600–1608.
 [20] S. Im, M. Naghshnejad, and M. Singhal, “Scheduling jobs with nonuniform demands on multiple servers without interruption,” in IEEE INFOCOM, 2016, pp. 1–9.
 [21] Y. Zhu, Y. Jiang, W. Wu, L. Ding, A. Teredesai, D. Li, and W. Lee, “Minimizing makespan and total completion time in mapreducelike systems,” in IEEE INFOCOM, 2014, pp. 2166–2174.
 [22] B. Moseley, A. Dasgupta, R. Kumar, and T. Sarlós, “On scheduling in mapreduce and flowshops,” in ACM SPAA. New York, NY, USA: ACM, 2011, pp. 289–298.
 [23] M. Lin, L. Zhang, A. Wierman, and J. Tan, “Joint optimization of overlapping phases in mapreduce,” SIGMETRICS Perform. Eval. Rev., vol. 41, no. 3, pp. 16–18, Jan. 2014.
 [24] D. Fotakis, I. Milis, O. Papadigenopoulos, V. Vassalos, and G. Zois, “Scheduling mapreduce jobs under multiround precedences,” in European Conference on Parallel Processing. Springer, 2016, pp. 209–222.
 [25] D. Fotakis, I. Milis, O. Papadigenopoulos, E. Zampetakis, and G. Zois, “Scheduling mapreduce jobs and data shuffle on unrelated processors,” in International Symposium on Experimental Algorithms. Springer, 2015, pp. 137–150.
 [26] Z. Qiu, C. Stein, and Y. Zhong, “Minimizing the total weighted completion time of coflows in datacenter networks,” in ACM SPAA, 2015.
 [27] S. Luo, H. Yu, Y. Zhao, S. Wang, S. Yu, and L. Li, “Towards practical and nearoptimal coflow scheduling for data center networks,” IEEE Trans. Parallel Distrib. Syst., vol. 27, no. 11, pp. 3366–3380.
 [28] M. Queyranne, “Structure of a simple scheduling polyhedron,” Mathematical Programming, vol. 58, no. 1, pp. 263–285, 1993.
 [29] J. Y.T. Leung, H. Li, and M. Pinedo, “Scheduling orders for multiple product types to minimize total weighted completion time,” Discrete Applied Mathematics, vol. 155, no. 8, pp. 945–970, 2007.
 [30] M. Mastrolilli, M. Queyranne, A. S. Schulz, O. Svensson, and N. A. Uhan, “Minimizing the sum of weighted completion times in a concurrent open shop,” Operations Research Letters, vol. 38, no. 5, pp. 390–395, 2010.
 [31] M. Grötschel, L. Lovász, and A. Schrijver, “The ellipsoid method and its consequences in combinatorial optimization,” Combinatorica, vol. 1, no. 2, pp. 169–197, 1981.