A Deep Reinforcement Learning Approach to Multicomponent Job Scheduling in Edge Computing
Abstract
We are interested in the optimal scheduling of a collection of multicomponent application jobs in an edge computing system that consists of geodistributed edge computing nodes connected through a wide area network. The scheduling and placement of application jobs in an edge system is challenging due to the interdependence of multiple components of each job, and the communication delays between the geographically distributed data sources and edge nodes and their dynamic availability. In this paper we explore the feasibility of applying Deep Reinforcement Learning (DRL) based design to address these challenges. We introduce a DRL actorcritic algorithm that aims to find an optimal scheduling policy to minimize average job slowdown in the edge system. We have demonstrated through simulations that our design outperforms a few existing algorithms, based on both synthetic data and a Google cloud data trace.
I Introduction
Edge computing is an emerging computing paradigm that can reduce network transmission cost of the large amount of data generated by edge devices (e.g., IoT sensors, mobile devices) and provide lowlatency service to end users, through processing computing jobs at the edge and close to end users [1, 2, 3, 4]. Edge computing is well suited for the bandwidth hungry, latencysensitive, and locationaware applications, e.g., realtime monitoring, cognitive assistance, and selfdriving vehicles.
An important issue in edge computing is the allocation of resources to the competing demands from different applications [5, 6, 7, 8, 9, 10, 11, 12]. Existing studies usually formulate edge resource allocation as optimization problems and design heuristics to solve the problems due to their hardness. Usually the proposed heuristics are only applicable for some particular types of edge computing environments. It is difficult to develop accurate models to solve resource allocation and job scheduling problems in edge systems due to the complexity of those systems.
Typically an edge application job consists of multiple task components that need to be processed at different edge computing nodes that are geodistributed [6, 13]. For example, an environment scientist may need to compare the environmental statistics of several geolocations. A smart city’s traffic control system needs to compare realtime statistics on the automobiles in a number of places in a city, and in this case a video analysis component running at an edge node needs to process the data collected onsite and then sends its result back to another edge node or the cloud for further analysis.
In this paper, we explore the feasibility of applying Deep Reinforcement Learning [14] techniques to the scheduling/placement of a collection of application jobs, each consisting of multiplecomponents to be executed in different edge nodes that are geodistributed far apart. We model the scheduling problem as a Markov Decision Process (MDP) [15], and introduce a Deep Reinforcement Learning (DRL) algorithm to find a good scheduling policy in an edge computing system. DRL has been applied recently in distributed systems and computer networks [16, 17, 18, 19, 20]. Instead of formulating problems with accurate mathematical models, DRL approximates complex stochastic dynamic systems as Deep Neural Networks (DNN) and makes optimal system scheduling from its experience. In DRL literature, there are primarily two kinds of widely used policies: valuebased and actorbased[14]. Our design leverages an actorbased method, that uses Deep Deterministic Policy Gradient[21] method to obtain the advantages of both valuebased methods and actorbased methods.
Our main contributions are as follows:

We model the scheduling of a collection of multicomponent application jobs in an edge computing system as a MDP problem, and introduce a Deep Reinforcement Learning (DRL) model to solve the problem. Our DRL model captures the status of edge nodes’ resource allocation to various components of different jobs, and the status of data transmissions between edge nodes. Our DRL policy training algorithm utilizes the recent DRL actorcritic method to find a near optimal online scheduling policy.

Through simulation of synthetic data and a realworld Google cloud data trace, we show that our proposed DRL design outperforms several existing algorithms including Shortest Job First (SJF), Least Bytes First (LBF), and random scheduling algorithm.
Ii Related Work
Resource allocation and job scheduling is a fundamental problem in edge computing [5, 8, 22, 7, 6, 9, 10, 23, 24, 6, 13, 11]. Existing work has shown the importance and challenges of designing efficient application job assignment and scheduling as many edge applications need coordinated computation and communication between multiple components (e.g., containers, microservices) assigned and scheduled on geodistributed edge nodes.
In addition, recently Deep Reinforcement Learning has been applied in studying resource allocation problems in distributed systems. Mao et al. [16, 17] leverage DRL techniques to solve a multiresource cluster scheduling problem, and they find that their RL agents sometimes can perform better than heuristics. [25] presents an adaptive method that uses a sequencetosequence model to optimize device placements. [18] investigates a modelfree DRL to minimize the average endtoend tuple processing time. [19] develops a DRLbased framework, DeepConf, to automatically learn and implement a wide range of data center networking techniques that replace existing heuristicbased approaches. [20] studies the problem of cooperative decision making from a collection of agents. [26] studies the stream processing in a cloud data center, but our work studies the edge computing applications for which there are communication delays between different edge computing nodes that work on the different components of a job. In addition, our system design is not restricted to jobs with DAG structures [26].
Iii System Model
Iiia Preliminaries
We consider an edge computing system that consists of edge nodes that are geodistributed and connected through a communication network, and a cloud data center that can conduct aggregate data analytics and help with the communication between edge nodes. Let denote the set of edge nodes. Each edge node is a cluster of computing servers that are colocated with a base station or in a physical location specified by the edge computing system (e.g., a community center, a library). A data communication network is formed between the edge nodes of the system.
The edge system serves the application jobs submitted by many clients. Each job has multiple task components. Different task components of a job may be executed by the system at different locations and at different times. There is an execution order among task components of a job, i.e., a component that follows some other component of the same job cannot start unless the other component has finished execution.
Assume that a job scheduler resides on the cloud part of the edge system. As soon as a job arrives at the system, its meta data (such as its number of components and where they are to be executed) gets submitted to the scheduler. The scheduler then uses a placement and scheduling policy to schedule the various components of submitted application jobs on different edge nodes to optimize some system performance metrics.
We discretize the system into time slots or steps with . Let denote the set of application jobs. Consider a job (i.e., ). Let denote the set of all task components of job , for which component should be executed before . Let denote the time slot in which job arrives at the system.
A task component is executed at an edge node that is close to its data input (e.g., from realtime video surveillance systems, or IoT sensors in a smart city). Transferring a task component to the cloud for execution is not an option due to the high network transmission cost. Let denote the set of edge nodes that are able to execute task component of application job , with .
A characteristic of the system we study is the interdependence and execution ordering between an application job’s multiple task components. Once a job ’s component finishes its execution at an edge node , it might send data/result to the next component of the same job, denoted by at another edge node . The delay of transmitting the data can be denoted as , where and represent respectively the size (in bytes) of the transmitted data and the bandwidth of the link between the two edge nodes and (the link can be a physical link or an overlay link). Figure 1 shows an example application job that consists of three components.
In addition, let denote the computation capacity of edge node . This capacity can be understood as the number of computing servers at the edge node, or the number of parallel executing cores at the node. We require that at each time slot , the total consumed computation resource of all task components executed concurrently at edge node should not exceed the computation capacity .
IiiB System Optimization Formulation
The objective of the edge system is to assign task components of all application jobs to edge nodes and use some scheduling policy to schedule the task components on all edge nodes collectively in order to minimize the total or average application slowdown. In this paper, we consider a central scheduler to schedule all application jobs submitted to the edge system, and leave the design of distributed schedulers as future work.
Let represent the edge node that executes task component of job , and . Let denote the assignment of component on an edge node . Here means that is assigned to ; otherwise. And denotes the scheduled starting time of component on node .
Let denote time duration needed for edge node to work on task component . If component is scheduled to be executed on edge node , the time needed to finish is given by .
The total time that a job stays in the system is given by
(1) 
with and . Let represents the ideal execution time duration of job without any delay due to the blocking from other jobs, then the slowdown of job is defined as .
The edge system’s scheduler attempts to find a placement policy for the components from all jobs, and in the mean time a scheduling policy , where , and , in order to minimize all jobs’ aggregated slowdown. That is, the system attempts to solve the following optimization problem.
(2) 
subject to the following constraints: (1) All components of a job should follow their execution ordering. That is, a component’s start time should be no earlier than the completion time of a previous component plus the data transmission delay from the previous component to this component, i.e., . (2) A task component can only be placed on one edge node and the edge node must be one of the multiple edge nodes that are eligible to execute the component due to edge computing requirement. (3) The resource requirements of all components (which might be from different jobs) that are scheduled to be executed concurrently at an edge node should not exceed the capacity of the edge node. Due to space limitations, a formal and complete optimization formulation is not presented here.
In this paper, we focus on the scheduling of task components of all jobs at edge nodes and assume that the placement of the components at edge nodes are known and fixed. Thus for presentation clarity, we will omit edge node subscripts in the remaining of the paper when applicable (e.g., using instead of ).
IiiC Dynamic Scheduling via Deep Reinforcement Learning
In practice, application jobs arrive at the system dynamically, usually following some stochastic process. We model the online job scheduling problem (2) as a Markov Decision Problem (MDP). To solve the problem, we introduce a scheduling policy based on Deep Reinforcement Learning (DRL)’s actorcritic method [21, 14], where deep neural networks are used as function approximators to estimate values or compute a policy distribution, and the architecture of the method is illustrated in Figure 2. Function approximator maps a state from state space to action space. An actor deep neural network computes the function and provides an action for a given state. Then a critic network returns value for each action using the Bellman equation, parameterized by . The actor network is trained by the deterministic policy gradient method, with the help of the critic network.
We next describe our DRLbased model of the edge system’s state space, action space, rewards, and our proposed training algorithm that generates an online multicomponent job scheduling policy, based on a large amount of training samples accumulated from past experience.
State Space. A state of the edge system in a time slot includes the current assignment of job components to edge nodes and the resource profiles of jobs in a job queue, and the current data transmission status of each edge node to other edge nodes. Let denote the state of the system at time , and then . Note that denotes the assignment of the components (of various jobs) in different time slots. That is, it includes the scheduling information of each job that arrives at time or earlier and for which the scheduler is able to schedule its components to start executing at certain time slots at different edge nodes. The state’s second element denotes the bandwidth usage status of the transmission links of each edge node to other edge nodes (which are used to transfer data between neighboring task components of the same job). And denotes the profiles of the jobs in a waiting queue. Furthermore denotes the number of jobs in a backlog queue. Note that encodes the interdependence and execution order of the components of each job. In order to have a fixed representation of system state, we choose a finite job queue with a size of , and leave all jobs that have arrived but have not been assigned to any edge node and are also not able to enter the queue in the backlog.
We use images to encode the state of the edge system (similar to [16]). Specifically, consists of four different types of images: edge node images for , network communication images for , waiting queue images for , and a backlog image for .

Edge node images show the assignment of task components on edge nodes. The width of an edge node image represents the amount of computation resource per time slot of that node, denoted by . We set it to be at least as large as the maximum resource requirement per time slot of a task component in our experiments. The height of an edge node image represents how far in future we can reserve an edge node’s resources. We set an image’s height to be at least as large as the maximum time requirement (measured in time slots) of a task component. We use to denote the height of an image.

Communication images show the status of data transfers between different components (of a job) that are executed at different edge nodes. Similar to edge node images, the width of a communication image (corresponding to a data transmission link between two edge nodes) represents the bandwidth of a link. It is sufficient to set the height of a communication image to be the amount of time needed to transmit the largest transmission data in our experiments.

Waiting queue images encode the profiles of the jobs in a queue waiting to be scheduled. Note that a job’s profile includes its component set, which has the following information of each component: (1) its computation resource requirement per time slot; (2) the number of time slots needed for each component; (3) the edge node to which a component is assigned; (4) the execution order of all components of a job.

Backlog image records the number of jobs that are in the backlog of the system.
We now illustrate our state representation via an example shown in Figure 3. In this example, there are 4 scheduled jobs with their components assigned on three edge nodes. There are two jobs in a waiting queue, and one job in a backlog queue. Different jobs are distinguished with different colors. There are three edge node images (the leftmost three images in the figure). The six images in the middle represent the data communication between these edge nodes. These images show the bandwidth between two edge nodes and the communication status of each task component. The width of each network image represents the capacity of a communication link between two edge nodes. We assume that at each time step, one edge node can only transfer one task component’s data to another edge node. The top rows of the edge node images and the communication images correspond to time slot , i.e., the current time of the system. The rightmost images show the profiles of the jobs in the waiting queue and the number of jobs in the backlog.
In Figure 3, the four jobs that have been scheduled for execution at three edge nodes in the next time steps are marked as red, yellow, blue, and gray. Take the bluecolored job as an example. The job has three task components, and the first is executed in the current time slot by edge node 3 using one resource slot, and edge node 3 will transfer the data of this first task component to edge node 1 in time slot 1. Then, the second task component will be executed in time slot 2 by edge node 1 using two resource slots. Later, edge node 2 will receive data which is transferred by edge node 1 in time slot 3, and will execute the third component in time slot 4 using two resource slots.
Note that a component of a job might not be able to transfer its data to the next component at some other edge node even if it is ready, because of the blocking by some other job. For example, the graycolored job’s first task component is executed in time slot 1, but after that, it cannot transfer its data from edge node 1 to edge node 3 (which will execute the second component of the same job) immediately, since edge node 1 is already scheduled to transfer the data of the first component of the yellowcolored job in time slot 2. Then, the data of the first task component of the graycolored job will be transferred in time slot 3.
Recall that there is an execution ordering between components of a job. In the example shown in Figure 3, the numbers shown in the images of the waiting queue represent the ordering of the task components of a job. For example, the job in Job Slot 1 has two task components. The six digits of in the bottom image of job slot 1 indicate that this image is the job’s first component, and the two digits of in the top image of Job Slot 1 indicate that this image is the job’s second component.
Action Space. In each time slot , the system has jobs to schedule, which results in an action space of size . In order to reduce the large action space, we utilize a method similar to [16, 18]. At each time , the scheduler’s action consists of multiple subactions . Each subaction is to choose a job in the queue . If , the scheduler will try to schedule the th job in the queue. We freeze the time until an invalid subaction is found by the scheduler. An invalid subaction can be: (1) , i.e., the scheduler does not choose any job; (2) , but the th job in the queue cannot be scheduled at due to the current status of edge nodes and/or their communication links, or the execution ordering requirement of the components of the th job. If the scheduler finds an invalid subaction, it will collect all of its valid subactions to form action and the system time will move forward again.
Rewards. Recall that the system’s objective is to minimize the average slowdown of all jobs. In each time slot , we let reward , and contains at time the jobs in the waiting queue, the jobs in the backlog, and the jobs that have been scheduled but not completed yet.
Training Algorithm to Find Scheduling Policy. Our algorithm utilizes a critic network and an actor network. The critic network is used only to train a actor network, not to provide actions to schedule jobs in the waiting queue. To train the actor network and the critic network in a stable and robust way, we use experience replay buffer , as suggested in [21]. At each time step, we store a quadruple as a sample in the replay buffer , which is finite with a fixed size . When the replay buffer is full, the oldest sample will be replaced. When the networks begin to be trained, a minibatch of samples from the replay buffer is chosen to update the networks.
For each quadruple sample in the minibatch, we first obtain an action for the next state , and obtain the estimated Qvalue for this pair. As the critic network is also used in calculating the estimated Qvalue and the actor network is also used to obtain , the networks are likely to be unstable and difficult to converge. Thus we use target networks and with soft updates to robustly train these two networks. That is, we use a target actor network to obtain for next state, and calculate estimated Qvalue by a target critic network. Then we set as the sum of immediate reward and the discounted Qvalue as follows: . In our experiments, . Then the critic network is updated by minimizing a common loss,
(3) 
The actor network is updated using sampled policy gradient,
(4) 
The target actor network and the target critic network are update softly,
(5)  
(6) 
Iv Evaluation
We next present some of our simulation results to compare our DRL scheduling policy’s performance with some other scheduling policies.
Iva Experiment Setup
We first generate two synthetic datasets to evaluate the performance of our DRL scheduling policy. Dataset has only two types of jobs, long job and short job. Each task component of a long job has a time requirement of 9 time slots and a computation resource requirement of 9 slots per time slot; for a short job’s task component, these two resource requirements are 8 and 18 respectively. For dataset 2, we randomize the time requirement and resource requirement per time slot of both long and short jobs’ task components, but we make sure the long jobs have longer time requirements (i.e., larger ) than the short jobs. The edge node images used by DRL agents to run the simulation of dataset 1 and dataset 2 have resource slots and time slots.
We also run simulation with a dataset from Google cloud trace [27]. We derive a task component’s time duration by its SCHEDULE flag and its FINISH flag. A component’s resource requirement per time slot is derived based on the capacity of CPU and memory that the component uses during its execution. We set the edge node images in the simulation of Google dataset to have resource slots per time slot and .
We evaluate our DRL scheduling policy based on different levels of average edge node workload . The workload of an edge node includes both computation workload and network transmission workload. The computation workload is calculated as the ratio of the total resources needed by all jobs in an episode over the total available resources of the edge nodes aggregated over all time slots in an episode. The network workload is calculated similarly. For the simulations of datasets 1 and 2, we let varies from to based on job arrival rates and . In our simulation based on Google dataset, we let varies from to based on job arrival rates and .
We are interested in two performance metrics: average job completion time (i.e., ), and average job slowdown . We compare our DRL scheduling policy with three other algorithms: Least Bytes First (LBF), Shortest Job First (SJF) and random scheduling. LBF gives the highest scheduling priority to the job with the least total bytes. The total bytes of a job is calcuated as the sum of all of its components’ computation sizes and all of its transmitted data sizes. Note that the computation size of a component is the product of its consumed number of resource slots per time slot and the number of time slots it needs. SJF gives the highest priority to the job with the least . Recall of a job includes only the total number of time slots needed by the job’s all components and the transmission delays of the job.
IvB Results
Figure 4 shows an example of the convergence of our DRL scheduling policy during training. In the beginning, the total reward that each episode receives grows fast. Then, near episode 117, the curve begins to converge.
Figures 5, 6, and 7 illustrate the average slowdown and the average completion time of different designs based on three datasets and different average edge node load . We observe that as the workload increases, the average completion time and slowdown will also increase.
In general, our DRL scheduling policy performs best among the four scheduling policies. For example, Figure 5 shows that the average slowdown of our DRL scheduling policy is less than LBF, less than SJF, and the average completion time of our DRL policy is less than LBF, and less than SJF when . Regarding the simulations of Google data, Figure 7 shows that our DRL scheduling policy has less average slowdown than SJF and LBF, and less completion time than SJF and LBF when . Additionally, simulations based upon Google data with ranging from to are also performed. Our DRL scheduling policy shows clear advantage in average completion time and average slowdown compared with other three scheduling policies.
V Conclusion and Future Work
We introduced a Deep Reinforcement Learning model to address the scheduling of a collection of multicomponent edge application jobs. Our DRL algorithm leverages the recent actorcritic method to search for an online scheduling policy, which outperforms a few existing algorithms according to our experiments with synthetic data and a Google cloud data trace. Our preliminary study in this paper has shown the feasibility of applying the stateofart DRL techniques to solve the complex resource allocation problems in edge computing systems.
For future work, we will extend the model to include application component placement problem besides scheduling. Furthermore, we will extend our centralized DRL scheduling design to multiagent collaborative DRL design.
References
 [1] S. Chen, T. Zhang, and W. Shi, “Fog computing,” IEEE Internet Computing, pp. 4–6, 2017.
 [2] S. Yi and et al., “Lavea: Latencyaware video analytics on edge computing platform,” in ACM/IEEE SEC, 2017.
 [3] Z. Hao, E. Novak, S. Yi, and Q. Li, “Challenges and software architecture for fog computing,” IEEE Internet Computing, pp. 44–53, 2017.
 [4] E. Markakis and et al., “Efficient next generation emergency communications over multiaccess edge computing,” IEEE Communications Magazine, pp. 92–97, 2017.
 [5] H. Tan, Z. Han, X. Li, and F. Lau, “Online job dispatching and scheduling in edgeclouds,” in IEEE INFOCOM 2017, 2017.
 [6] Z. Cao, H. Zhang, and B. Liu, “Performance and stability of application placement in mobile edge computing system,” in IEEE IPCCC, 2018.
 [7] Z. Cao, H. Zhang, B. Liu, and B. Sheng, “A gametheoretic framework for revenue sharing in edgecloud computing system,” in 2018 IEEE IPCCC.
 [8] J. Xu, L. Chen, and P. Zhou, “Joint service caching and task offloading for mobile edge computing in dense networks,” in IEEE INFOCOM 2018, 2018.
 [9] L. Wang and et al., “Service entity placement for social virtual reality applications in edge computing,” in IEEE INFOCOM 2018, 2018.
 [10] Q. Qin and et al., “Sdn controller placement at the edge: Optimizing delay and overheads,” in IEEE INFOCOM 2018, 2018.
 [11] R. Cziva and et al., “Dynamic, latencyoptimal vnf placement at the network edge,” in IEEE INFOCOM 2018, 2018.
 [12] D. Zhang and et al., “Heteroedge: taming the heterogeneity of edge computing system in social sensing,” in ACM IoTDI, 2019.
 [13] T. Bahreini and D. Grosu, “Efficient placement of multicomponent applications in edge computing systems,” in ACM/IEEE SEC, 2017.
 [14] G. DulacArnold and et al., “Deep reinforcement learning in large discrete action spaces,” in arXiv preprint arXiv:1512.07679, 2015.
 [15] R. Sutton and A. Barto, Reinforcement Learning: An introduction. MIT Press, 2018.
 [16] H. Mao and et al., “Resource management with deep reinforcement learning,” in ACM HotNetsXV, 2016.
 [17] ——, “Neural adaptive video streaming with pensieve,” in Proceedings of the Conference of the ACM Special Interest Group on Data Communication, 2017.
 [18] T. Li, Z. Xu, J. Tang, and Y. Wang, “Modelfree control for distributed stream data processing using deep reinforcement learning,” Proceedings of the VLDB Endowment.
 [19] S. Salman and et al., “Deepconf: Automating data center network topologies management with machine learning,” in Proceedings of the 2018 Workshop on Network Meets AI & ML, 2018.
 [20] C. Zhang and et al., “A cooperative multiagent deep reinforcement learning framework for realtime residential load scheduling,” in ACM IoTDI 2019.
 [21] T. Lillicrap and et al., “Continuous control with deep reinforcement learning,” in arXiv preprint arXiv:1509.02971, 2015.
 [22] T. He and et al., “It’s hard to share: Joint service placement and request scheduling in edge clouds with sharable and nonsharable resources,” in 2018 IEEE ICDCS, pp. 365–375.
 [23] M. Abranches, S. Goodarzy, M. Nazari, S. Mishra, and E. Keller, “Shimmy: Shared memory channels for high performance intercontainer communication,” in USENIX HotEdge 19, 2019.
 [24] N. Talagala, S. Sundararaman, V. Sridhar, D. Arteaga, Q. Luo, S. Subramanian, S. Ghanta, L. Khermosh, and D. Roselli, “ECO: Harmonizing edge and cloud with ml/dl orchestration,” in USENIX HotEdge 18, 2018.
 [25] A. Mirhoseini and et al., “Device placement optimization with reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning, 2017.
 [26] H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, and M. Alizadeh, “Learning scheduling algorithms for data processing clusters,” in ACM SIGCOMM, 2019.
 [27] Google. Google cloud cluster data. http://googleresearch.blogspot.com/2010/01/googleclusterdata.html.