Bandwidth-Aware Scheduling with SDN in Hadoop: A New Trend for Big Data
Software Defined Networking (SDN) is a revolutionary network architecture that separates out network control functions from the underlying equipment and is an increasingly trend to help enterprises build more manageable data centers where big data processing emerges as an important part of applications. To concurrently process large-scale data, MapReduce with an open source implementation named Hadoop is proposed. In practical Hadoop systems one kind of issue that vitally impacts the overall performance is know as the NP-complete minimum make span111Make span means the time between job’s start time and job’s finish time. problem. One main solution is to assign tasks on data local nodes to avoid link occupation since network bandwidth is a scarce resource. Many methodologies for enhancing data locality are proposed such as the HDS  and state-of-the-art scheduler BAR . However, all of them either ignore allocating tasks in a global view or disregard available bandwidth as the basis for scheduling. In this paper we propose a heuristic bandwidth-aware task scheduler BASS to combine Hadoop with SDN. It is not only able to guarantee data locality in a global view but also can efficiently assign tasks in an optimized way. Both examples and experiments demonstrate that BASS has the best performance in terms of job completion time. To our knowledge, BASS is the first to exploit talent of SDN for big data processing and we believe it points out a new trend for large-scale data processing.
Bandwidth-Aware, Scheduling, Software Defined Networking, Hadoop, Big Data
Software Defined Networking (SDN)222https://www.opennetworking.org/. is a revolutionary network architecture that separates out network control functions from the underlying equipment and deploys them centrally on the controller, where OpenFlow is the standard interface. With SDN, applications can treat the network as a logical entity which makes enterprises and carriers gain unprecedented programmability, automation and network control. In addition, SDN also provides a set of APIs to simplify the implementation of common network services such as routing, multicast, security, access control, bandwidth management, quality of service (QoS) and storage optimization etc .
As a result, SDN creates amounts of opportunities to help enterprises build more deterministic, innovative, manageable and high scalable data centers that extend beyond private enterprise networks to public IT resources, which makes implementing OpenFlow based SDN data centers become a new developing trend nowadays.
At the same time, big data processing emerges as an important part of applications in such kind of data centers, where they not only handle but also generate amounts of data everyday. To concurrently process large-scale data with high efficiency, MapReduce with an open source implementation named Hadoop333http://lucene.apache.org/. is proposed. It is an increasingly common computing system used by Yahoo!, Amazon and Facebook etc. The logical view of Hadoop system is shown on the upper left of Fig. 1 while the physical view is shown at the bottom.
In practical Hadoop systems one kind of issue that vitally impacts the overall performance is know as the NP-complete  minimum make span problem  , which exploits to find solutions on how to minimize the job completion time. Since network bandwidth is a scarce resource, assigning tasks on data local nodes is important for avoiding link occupation and then shortening the make span.
Many methodologies for enhancing data locality are proposed such as the Hadoop Default Scheduler (HDS)  and state-of-the-art BAlance-Reduce scheduler (BAR) . However, all of them either ignore allocating tasks in a global view or disregard available bandwidth as the basis for scheduling which results in losing optimized opportunities for task assignment.
As the new trend for big data processing evolves with SDN, one question to handle the minimum make span issue is naturally asked: Can we combine the bandwidth control capability of SDN with Hadoop system to exploit an optimized task scheduling solution that has high efficiency and agility in terms of job completion time for big data processing? (shown by the question mark on top of Fig. 1)
In this paper we propose a bandwidth-aware task scheduler BASS (Bandwidth-Aware Scheduling with Sdn in hadoop) to combine Hadoop with SDN. It first utilizes SDN to manage the bandwidth and allocates it in a Time Slot (TS) manner, then BASS decides whether to assign a task locally or remotely depending on the completion time. It is not only able to guarantee data locality in a global view but also can efficiently assign tasks in an optimized way. The most important is that BASS takes a full consideration of the scarce network bandwidth from OpenFlow controller and regards it as a vital parameter for task scheduling. To our knowledge, BASS is the first to exploit talent of SDN for big data processing in Hadoop. Both examples and real world experiments demonstrate that BASS outperforms all previous related algorithms including BAR which represents state-of-the-art.
The main contributions are summarized as follows.
We formalize the make span problem and develop a TS scheme for bandwidth allocation.
We exploit the capability of SDN and propose a bandwidth-aware task scheduler BASS which outperforms all previous related algorithms.
I-B Paper structure
The remainder of the paper is structured as follows. In Section II we review some related work. In Section III we formalize the problem of scheduling in Hadoop cluster. In Section IV we propose the SDN based bandwidth-aware scheduler BASS and present detailed examples for illustration. In Section V we describe experiment details. Section VI concludes the paper and provides future expectations.
Ii Related work
A broad class of prior literatures ranging from big data processing to new emerging SDN is related to our work.
The Hadoop default scheduler searches for data local tasks greedily and assigns them to idle nodes  which, however, results in an increased job completion time. Matei et al.  propose delay scheduling to address the conflict between data locality and fairness. However, the introduced delays may lead to under-utilization and instability. Jian Tan et al.  find that for current schedulers, map tasks and reduce tasks are not jointly optimized, which may cause job starvation and unfavorable data locality. To mitigate this problem they propose a coupling scheduler to combine map and reduce tasks. Since Hadoop assumes that all cluster nodes are dedicated to a single user, it fails to guarantee high performance in shared environment. To address this issue, Sangwon et al.  propose a prefetching and pre-shuffling scheme. However, the bandwidth occupation for transferring data block cannot be significantly reduced.
Jiahui et al.  propose the BAR scheduler to globally reduce the job completion time which is the most related to BASS. It is based on prior work  proposed by Michael et al. to assign tasks efficiently in Hadoop. In  they investigate task assignment in Hadoop and give an idealized Hadoop model to evaluate the cost of task assignments. It is shown that task assignment is a NP-complete problem which, however, can only be found a near optimal solution with high computation complexity. To address this issue BAR first produces an initial task allocation, then the job completion time can be gradually reduced by tuning the initial task allocation. However, in some cases, such as Discussion 1 shows in Section IV, BAR cannot efficiently reduce the job completion time while BASS can reduce it from 39s to 35s.
Independently, SDN originating from Clean Slate by University of Stanford is a new network architecture that separates out network control functions from the underlying equipment. The underlying switches perform only simple data forwarding. The leading SDN technology is based on the OpenFlow protocol , a standard that has been designed for SDN and already being deployed in a variety of networks and networking products  . The most important feature of OpenFlow is its capability of network monitoring and traffic control which gives an alternative solution to speed up the Hadoop big data processing system.
Iii Problem formalization
With the capability of network control provisioned by SDN, we can capture the real time network status such as network traffic and bandwidth. We define some notations as follows :
denotes a task within a Hadoop job; denotes a node in the Hadoop cluster; denotes the size of input split data for when it is assigned to ; denotes the data movement time of from data source to ; denotes the time of task computation; denotes the time for task execution; denotes the time when becomes idle; denotes the completion time of ; denotes the bandwidth between and while denotes the real time available bandwidth of a link. Based on above symbols we obtain Eq.(1) to Eq.(3).
For a map or reduce task , the Objective Function (shown in Eq.(4)) is to find an available node that can yield the earliest completion time among all nodes of the cluster.
In the aspect of global view for a job, however, the Objective Function (shown in Eq.(5)) is a little different, where we need to find the slowest map or reduce task to minimize the completion time of a whole job.
where is the task number of a job and is the node number of the Hadoop cluster.
Iv SDN based bandwidth-aware scheduling in Hadoop for big data processing
Iv-a Time Slot Bandwidth Allocation
Benefiting from capability of SDN to obtain the real time link bandwidth , we propose a scheme to allocate bandwidth in a Time Slot way. The main principle is described as follows.
Before Hadoop task scheduling begins, the occupation time of each link’s residue bandwidth is disintegrated into equal time slot, namely, , duration of which is a tunable parameter according to practical network scenarios. We use to denote the residue bandwidth at a certain time slot for a certain link. If the task has the requirement of data movement through a certain path during , the scheduler will assign the corresponding Time Slots to it in advance guaranteeing that the bandwidth of all links on this path from starting slot () to ending slot () are reserved for .
The motivation for proposing TS scheme is as follows. Bandwidth is a scarce resource in practical Hadoop cluster especially when nodes compete drastically. Thus, to sufficiently utilize available bandwidth we argue that always providing tasks requiring data movement with the most residue bandwidth and then taking it back after the occupation is a simple but effective solution in practice. Both Example 1 and real world experiments demonstrate its validity.
Iv-B BASS: Bandwidth-Aware Scheduling with Sdn in hadoop
The BASS algorithm is illustrated as follows. A submitted job has tasks and there are available nodes in a Hadoop cluster. Note that, the number of available nodes may be less than the total nodes of the cluster especially when Hadoop system is shared by users.
Case 1: Data local node is found
For a task , since bandwidth is a scarce resource in Handoop system we prefer to assign it to a data local node if there is one. When is found, its available idle time is recorded. However, is not always the optimal option especially when there are already too many workloads on which will let wait for a nontrivial extra time resulting in a much longer job completion time. In this scenario we take the data movement time into account and treat it as a significant parameter for job scheduling. Then we search and find one node whose available idle time is minimum for the current time. is also recorded for further analysis.
Case 1.1: Data local node is optimal
If data local node is just the node or its available idle time is no greater than , we assign to directly since there is no cost of data movement according to Eq.(1).
If data local node is found but its available idle time is greater than , there will be a tradeoff on whether to assign to node or not, depending on the residue bandwidth (or of time slot) of links on path from data source to . In this case to make sure that the remote node is a better choice than the data local node for running task , we need to firstly calculate the task completion time and using Eq.(1)to Eq.(3).
Subject to the objective that is smaller than (), we obtain the corresponding bandwidth needed for moving data to remote node . Then we compare with the real time available bandwidth .
Case 1.2: Remote node is optimal
If , it indicates that the available bandwidth is enough for transferring input data for with task completion time earlier than . In this case we assign to remote node and assign time slots on path from data source to according to SDN controller. Note that the TSs on a that are allocated to task are determined by the residue TSs of it belongs to, which are equal to the minimum residue TSs of all its links.
Case 1.3: is not optimal for limited bandwidth
If , it indicates that the available bandwidth in not enough for moving input data for with task completion time earlier than . In this case since the total cost of including the data transferring expense is much greater than data local node , there is no need to run task remotely. Therefore, we assign to .
Case 2: Data local node is not found (locality-starvation)
All above cases assume that can be found and is available. However, Hadoop cluster may be shared by different users and each of whom is only authorized to use a subset of the whole nodes. Thus the input split data has high probability of not being stored in any of these nodes. Therefore, may not be found in this scenario which we call locality-starvation in this paper. To deal with it, Algorithm 1 proposes a similarly solution like the case where remote node is the optimal.
In this case we assign to remote node and assign time slots on path from data source to according to SDN controller. The TSs on a that are allocated to task are also determined by the residue TSs of it belongs to, which are equal to the minimum residue TSs of all its links.
All the tasks are scheduled via above process until an allocation result is obtained. BASS algorithm is an optimized scheduling scheme compared with the HDS and BAR schedulers for the follows three reasons.
Firstly, BASS maintains the priority of data locality with no cost of transferring data among nodes.
Secondly, BASS utilizes SDN’s bandwidth management capability to assign link bandwidth for a remote node making sure that the task completion time is earlier than that of a local node .
Last but not least, BASS utilizes a TS scheme to allocate link bandwidth in a temporal dimensionality, which is a simple but effective solution in practice.
To explain this algorithm clearly we give the following example.
As is shown in Fig. 2, an OpenFlow controlled Hadoop cluster is composed of 4 task nodes, namely, , an OpenFlow Controller and a Master Node/Scheduler. There are 2 OpenFlow switches, a router and 8 links, namely, connecting all nodes. A job with 9 tasks is scheduled by the master node. Each input split data has 2 replicas located in 2 different nodes, respectively.
Assuming that size of each data block is 64MB and each link bandwidth is 100Mbps, if the available bandwidth percentage is 100%, then data movement time calculated by Eq.(1) is 5.12s. Here we choose 5s for simplification. We set each time slot to be 1s in this paper thus one data block movement occupies 5 time slots. Since each node is homogeneous, task computation time is equal and we use 9s for illustration in this example.
After scheduling starts, BASS allocates tasks sequentially as Algorithm 1 describes and Fig. 3(a) shows the allocation result. Take the first task for instance, the available idle time (initial loads) of 4 servers are , , , , respectively.
For , BASS first finds a node with data locality and records its available idle time ; Then it finds another node whose available idle time is minimum for the current time; Since which means the available idle time of data local node is greater than a remote node , then it uses Eq.(1), Eq.(2), Eq.(3) to calculate the task completion time and running on the two nodes. By checking that the residue bandwidth of 100Mbps on Link 1 and Link 2 is enough for moving data block 1 with data transferring time and , it confirms that the task completion time running on remote node is less than that of running on data local node with . Therefore, it allocates to . BASS allocates other tasks in the same way. The allocation result is shown clearly in Fig. 3(a) where we can see the job completion time is 35s since the last task running on determines the completion time of a whole job with .
In this case BASS transfers input split data for from , thus the residue bandwidth on Link 1 that is 100% of 100Mbps from 3s to 8s is allocated for data movement. This means the occupied time slots consist of . The occupation of time slots on Link 2 is the same.
Note that, we may also choose to transfer input split data for . In this case Link 1, Link 7, Link 8 and Link 3 need to allocate time slots for data movement where the occupation is also the same as before.
To see the efficiency of BASS scheduler we choose HDS and BAR scheduler that stands for state-of-the-art in this domain to assign the same 9 tasks for comparison. Also assuming that data movement time is 5s and task computation time is 9s.
HDS always chooses a task for with data locality and assigns to it. If no data local task is available, HDS scheduler will choose a task randomly for . Take for instance, replicas of its input split data are stored at and , after comparing the available idle time of them, HDS knows , thus, it allocates to node . Similarly, remaining tasks are assigned. Note that, when is available at 25s, only non-local is left, then has to carry it out. Finally, executes ; executes ; executes ; executes . Since running on determines the completion time of the whole job, we see that the job completion time is 39s which is 4s later than that of BASS scheduler (shown in Fig. 3(b)).
BAR scheduler is based on HDS and it further improves job performance by globally adjusting data locality according to network state and cluster workload. In the first phase, BAR allocates tasks to nodes obeying data locality principle with the same result shown in Fig. 3(b). In the second phase, BAR searches for the task whose completion time is the latest and checks if there is a remote node with the completion time earlier than . If is found BAR assigns to it and repeats this process until no such node is available. Using the same parameters aforementioned and taking the second phase of BAR scheduler for instance. Since the data local assignment is obtained in the first phase, BAR knows that on is the latest one with completion time (shown in Fig. 3(c)). It then checks if with available idle time can run task with completion time ? Fortunately, , is smaller than 39s. Therefore, BAR moves from to with job completion time being 38s, as is shown in Fig. 3(d).
From this case we can see clearly that BASS scheduler outperforms BAR and HDS to improve the overall performance in terms of job completion time. A comparison of HDS, BAR and BASS is summarized in Fig. 4.
Can we further reduce the job completion time?
Sangwon et al. in  propose a prefetching scheme to improve the overall performance under shared environment while retaining compatibility with the native Hadoop scheduler. Inspired by this idea we propose to use a similar prefetching method called Pre-BASS to further reduce the job completion time.
The main process is described as follows. Firstly, the optimized Pre-BASS scheduler allocates tasks guaranteeing that each one is optimal in terms of its completion time. Then Pre-BASS checks each data-remote task and let its input split data prefetched/transferred before the available idle time, , as early as possible which depends on the real-time residue bandwidth or . Note that when prefetching a data block, it is always moved from the least loaded node storing the replica to minimize the impact on the overall performance. We give another example to illustrate Pre-BASS scheme.
Thinking about the first task of BASS running on remote node (as Fig. 3(a) shows) in Example 1. The data movement starts at 3s and it occupies five time slots, namely . If we utilize the prefetching scheme here to let this task prefetch its input data at 0s and occupy time slots , then the completion time of all tasks on this node will be reduced from 35s to 32s. In this way the last finished task will not be but and the job completion time will not be 35s but 34s which is a further performance improvement in the global view (see the right side of Fig. 4 for performance comparison).
How can we make the utmost of SDN?
One important feature of SDN/OpenFlow is its simple QoS scheme via a queuing mechanism. Recall the huge volume of shuffling traffic which consumes large amounts of bandwidth. If Hadoop traffic especially the shuffling traffic is provided with higher priority utilizing the QoS capability of OpenFlow , we believe that the overall performance of Hadoop system in terms of job completion time will be further reduced. We also give an example to illustrate it.
Take the topology of SDN controlled Hadoop cluster in Fig. 2 for instance. We first set the maximum rate of both OpenFlow switches to be 150Mbps and setup three queues: with 100Mbps, with 40Mbps, with 10Mbps. Then new flow entries are added to direct shuffling traffic to that has higher link bandwidth and to direct background traffic to to limit its impact on Hadoop task. The rest of traffic occupy . This simple scheme outperforms that of putting all traffic in the same queue with maximum rate of 150Mbps which is the default scheme.
V Experiments for performance evaluation
In this section we present real world experiments to investigate the effectiveness of BASS. For comparison, two prior most-related schedulers HDS and BAR described in Section IV are also implemented.
V-a Experiment Setup
In this experiment the Hadoop cluster with OpenFlow switches is shown is Fig. 2. The cluster consisting of 6 nodes located in 5 physical systems runs the Hadoop version 1.2.1 connected to 2 Open vSwitch (OVS)444http://openvswitch.org/..
The number of block replicas is set to be 3. The size of data block is 64MB. The maximum link rate is set to be 100Mbps which is a tunable parameter in practice.
We utilize the ProgressRate scheme which works well in practice to estimate the initial workload and the available idle time of each node. The progress rate of each task is calculated by , where ProgressScore represents the task progress between 0 and 1; is the amount of time the task has been running for. The time to complete is then estimated by .
We choose both Wordcount and Sort jobs555We choose Wordcount and Sort for test because the former consumes more CPU while the later occupies more disk I/O resources. as our test case and run them for different workload with data size to be 150MB, 300MB, 600MB, 1GB and 5GB, respectively. We repetitively execute a background job to provide each test with initial workload. Each test is run for 20 times and we use the average value for comparison.
V-B Experiment Results
From Table II(a) we see that for Wordcount job, BASS scheduler always finishes with the minimum map and reduce phase completion time, namely the minimum make span compared with BAR and HDS. In some cases data locality ratio (LR) of BASS is low, taking the input data of 600M for example, LR of BASS is 58.3% while LR of BAR and HDS are 66.7% and 83.3%. However, the make span of BASS is only 231s compared with 259s of BAR and 269s of HDS.
The reason is that in this scenario bandwidth resource is sufficient for transferring data and computation resource on data local node is scarce, thus running on is not a good choice anymore. In this sight we argue that link bandwidth and data locality should be taken integratedly into account to achieve the best overall performance in practice.
Sort job in Table II(b) also demonstrates the validity of BASS scheduler in real world system.
A clear comparison of above three schedulers is shown in Fig. 5 for both Wordcount and Sort jobs, where we can see BASS outperforms the other two in terms of job completion time.
Vi Conclusions and Expectations
In this paper, we exploit to utilize SDN and take a full account of link bandwidth for improving performance of big data processing. We first formalize the make span problem in Hadoop and develop a Time Slot (TS) scheme for bandwidth allocation. Then we propose the SDN based bandwidth-aware scheduler BASS which can flexibly assign tasks in an optimized way. Last but not least, We provide examples and implement extensive real world experiments to demonstrate the effectiveness of BASS. To our knowledge, BASS is the first to exploit talent of SDN for big data processing in Hadoop cluster.
As the evolvement of SDN with big data processing, for future work we plan to implement BASS in enterprise’s data centers composed of practical SDN products such as the OpenFlow switch and we will evaluate BASS’s scalability in a much larger network cluster. Furthermore, we believe that BASS points out a new trend for large-scale data processing.
-  T. White, Hadoop: The Definitive Guide (Third Edition). O’Reilly Media Incorporation, May 2012.
-  J. Jin, J. Luo, A. Song, F. Dong, and R. Xiong, “Bar: An efficient data locality driven task scheduling algorithm for cloud computing,” in Proceedings of the 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) 2011, Newport Beach, California, USA, May 2011, pp. 295–304.
-  S. Azodolmolky, P. Wieder, and R. Yahyapour, “Cloud computing networking: challenges and opportunities for innovations,” IEEE Communications Magazine, vol. 51, no. 7, pp. 54–62, July 2013.
-  M. J. Fischer, X. Su, and Y. Yin, “Assigning tasks for efficiency in hadoop: Extended abstract,” in Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2010), Thira, Santorini, Greece, June 2010, pp. 30–39.
-  K. Birman, G. Chockler, and R. van Renesse, “Toward a cloud computing research agenda,” SIGACT News, vol. 40, no. 2, pp. 68–80, June 2009.
-  E. Bortnikov, “Open-source grid technologies for web-scale computing,” SIGACT News, vol. 40, no. 2, pp. 87–93, June 2009.
-  M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, “Job scheduling for multi-user mapreduce clusters,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2009-55, April 2009. [Online]. Available: http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-55.html
-  ——, “Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling,” in Proceedings of the 5th European conference on Computer systems (EuroSys 2010), Paris, France, April 2010, pp. 265–278.
-  J. Tan, X. Meng, and L. Zhang, “Coupling task progress for mapreduce resource-aware scheduling,” in Proceedings of the IEEE INFOCOM 2013, Turin, Italy, April 2013, pp. 1618–1626.
-  S. Seo, I. Jang, K. Woo, I. Kim, J.-S. Kim, and S. Maeng, “Hpmr: Prefetching and pre-shuffling in shared mapreduce computation environment,” in Proceedings of the IEEE International Conference on Cluster Computing and Workshops, 2009. (CLUSTER ’09), New Orleans, Louisiana, USA, September 2009, pp. 1–8.
-  N. Mckeown, S. Shenker, T. Anderson, L. Peterson, J. Turner, H. Balakrishnan, and J. Rexford, “Openflow: Enabling innovation in campus networks,” ACM SIGCOMM Computer Communication Review, vol. 38, no. 2, pp. 69–74, January 2008.
-  “Hadoop: scalable flexible data storage and analysis,” http://www.cloudera.com/content/cloudera/en/resources/library/whitepaper/hadoop_scalable_flexible_storage_and_analysis.html, May 2010.
-  S. Narayan, S. Bailey, M. Greenway, R. Grossman, A. Heath, R. Powell, and A. Daga, “Openflow enabled hadoop over local and wide area clusters,” in Proceedings of the 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC), Salt Lake City, Utah, USA, November 2012, pp. 1625–1628.
-  Z. Guo and G. Fox, “Improving mapreduce performance in heterogeneous network environments and resource utilization,” in Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Ottawa, Ontario, Canada, May 2012, pp. 714–716.
-  S. Narayan, S. Bailey, and A. Daga, “Hadoop acceleration in an openflow-based cluster,” in Proceedings of the 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC), Salt Lake City, Utah, USA, November 2012, pp. 535–538.