DeepCAS: A Deep Reinforcement Learning Algorithm for ControlAware Scheduling
Abstract
We consider networked control systems consisting of multiple independent closedloop control subsystems, operating over a shared communication network. Such systems are ubiquitous in cyberphysical systems, Internet of Things, and largescale industrial systems. In many largescale settings, the size of the communication network is smaller than the size of the system. In consequence, scheduling issues arise. The main contribution of this paper is to develop a deep reinforcement learningbased controlaware scheduling (DeepCAS) algorithm to tackle these issues. We use the following (optimal) design strategy: First, we synthesize an optimal controller for each subsystem; next, we design learning algorithm that adapts to the chosen subsystem (plant) and controller. As a consequence of this adaptation, our algorithm finds a schedule that minimizes the control loss. We present empirical results to show that DeepCAS finds schedules with better performance than periodic ones. Finally, we illustrate that our algorithm can be used for scheduling and resource allocation in more general networked control settings than the abovementioned one.
Deep Learning; Reinforcement Learning; Optimal Control; Networked Control Systems
I Introduction
Today’s cyberphysical systems (CPS), Internet of Things (IoT), largescale industrial systems, and myriad realworld systems try to integrate traditional control systems with artificial intelligence to improve the overall system performance and reliability. Examples of such systems include smart grids, smart cities, citywide vehicular traffic control, industrial process optimization, among others. In recent years, artificial intelligence (AI) has seen a resurgence as an effective modelfree solution to many problems, including optimal control problems arising in the aforementioned systems. This resurgence is partly owing to the advances in computational capacities and the advent of deep neural networks for function approximation and feature extraction. Oftentimes, the use of reinforcement learning algorithms or AI in conjunction with traditional controllers reduces the complexity of system design while boosting efficiency.
The large size of systems poses tremendous challenges in resource allocation. In applications involving feedback, this challenge is exaggerated since resource allocation is required to be “controlaware”, i.e., it should reduce control loss. Typical resources including communication channels and computational resources do not scale with system size. In largescale systems, controllers often rely on information collected from various sensors to make intelligent decisions. Hence, efficient information dispersion is essential for decision making over communication networks to be effective. As noted earlier, this is a hard problem since the number of communication channels available is much smaller than what is ideally required to transfer data from sensors to controllers.
In this paper, we present a deep reinforcement learningbased scheduling algorithm for the sensortocontroller communication in largescale networked control systems. To illustrate our ideas, we use the system architecture illustrated in Fig. 1. Later, we will show that our ideas can be used for scheduling in more general architectures.
Fig. 1 is a simplified representation of (some) CPS and IoT systems. The system consists of independent control subsystems that communicate over a shared communication network, which contains channels. We assume that ( is much smaller than ) and, in particular, that each channel can be used to serve any communication request. Each control subsystem consists of a smart sensor, a controller, and a plant to be controlled. The feedback loops are closed over this resourceconstrained communication network. For more details on the system model, we refer the reader to § II.
DeepCAS, our deep reinforcement learningbased modelfree scheduling algorithm, decides which of the subsystems are allocated to the channels in a given time instant. Further, it adapts to the controller at hand and finds a schedule that minimizes the control loss. We present an implementation wherein DeepCAS obtains feedbacks (i.e., rewards) from smart sensors for making decisions. At each time instant, each smart sensor computes the estimate of the subsystem’s state, using Kalman Filter (I), and transmits it to the corresponding controller; see Fig. 1. The controller runs Kalman Filter (II) to estimate the subsystem’s state in the absence of transmissions. In addition to Kalman Filter (I), a smart sensor also implements a copy of Kalman Filter (II) and the control algorithm. This way the smart sensor knows the state estimate used by the controller at every time instant.
Several scheduling approaches for control systems have been proposed in the literature to determine the access order of different sensors and/or actuators; see [1] and references therein. A widely considered approach is to employ periodic schedules [2, 3, 4, 5]. However, finding such schedules for control applications may not be easy since both specific period and sequence need to be found. Further, periodic sequences may not be even optimal in which case searching for them is futile. With a few exceptions (e.g., [6] and [7]), the determination of optimal schedules indeed requires solving a mixedinteger quadratic program, which is computationally infeasible for all but very small systems.
Our contribution is in the development of a deep reinforcement learningbased controlaware scheduling (DeepCAS) algorithm. At its heart lies the Deep QNetwork (DQN), a modern variant of Q learning, introduced in [8]. In addition to being readily scalable, DeepCAS is completely modelfree. To optimize the overall control performance, we propose the following (optimal) design of control and scheduling: In the first step, we design an optimal controller for each independent subsystem. As discussed in [9], under limited communication, the control loss has two components: (a) best possible control loss (b) error due to intermittent transmissions. If , then (b) vanishes. Since we are in setting of , the goal of the scheduler is to minimize (b). To this end, we first construct a Markov decision process (MDP). The state space of this MDP is the difference in state estimates of all controllers and sensors (obtainable from the smart sensors). Its reward is the negative of the previously mentioned (b). Since we are using DQN (reinforcement learning) to solve this MDP, we do not need the knowledge of transition probabilities. Since the goal of DQN is to find a scheduling strategy that maximizes reward, it naturally aims at minimizing (b). For more details on the MDP, the reader is referred to § III. We present empirical results showing that DeepCAS finds schedules that have lower control losses than traditional handtuned heuristics.
The organization of this paper is as follows: Section II introduces models and assumptions for subsystems, sensors, controllers, and network. This section also presents the primary objective of the paper and how to reach this objective via solving control design and controlaware scheduling problems separately. Section III presents an MDP associated with controlaware scheduling problems and proposes a deep reinforcement learning algorithm to solve this MDP efficiently. In Section IV, numerical examples are used to illustrate the power of our AIbased scheduling algorithm. Section V gives concluding remarks and directions for future work.
Ii Networked Control System: Model, Assumptions, and Goals
Iia Model for each subsystem
As illustrated in Figure 1, our networked control system consists of independent closedloop subsystems. The feedback loop within each subsystem (plant) is closed over a shared communication network. For , subsystem is described by
(1) 
where and are matrices of appropriate dimensions, is the subsystem ’s state, is the control input, and is zeromean i.i.d. Gaussian noise with covariance matrix . The initial state of subsystem , , is assumed to be a Gaussian random vector with mean and covariance matrix .
At a given time , we assume that only noisy output measurements are available. We thus have:
(2) 
where is zeromean i.i.d. Gaussian noise with covariance matrix . All noise sources, and , are independent of the initial conditions .
IiB Control architecture and loss function
The dynamics of each subsystem is a stochastic linear timeinvariant (LTI) system given by (1). Further, each subsystem is independently controlled although dependencies do arise from sharing a communication network. Subsystem has a smart sensor which samples the subsystem’s output and computes an estimate of the subsystem’s state. This value is then sent to the associated controller, provided a channel is allocated to it by DeepCAS. If the controller obtains a new state estimate from the sensor, then it calculates a control command based on this state estimate. Otherwise, it calculates a control command based on its own estimate of the subsystem’s state.
The control actions and scheduling decisions (of DeepCAS) are taken to minimize the total control loss given by
(3) 
where is the control loss of subsystem and is given by
(4) 
where and are positive semidefinite matrices and is positive definite.
IiC Smart sensors and preprocessing units
Within our setting, the primary role of a smart sensor is to take measurements of a subsystem’s output. Also, it plays a vital role in helping DeepCAS with scheduling decisions. It is from the smart sensors that DeepCAS gets all the necessary feedback information for scheduling. For these tasks, each smart sensor employs two Kalman filters: (1) Kalman Filter (I) is used to estimate the subsystem’s state, (2) a copy of Kalman Filter (II) is used to estimate the subsystem’s state as perceived by the controller. Note that the controller employs Kalman Filter (II). Below, we discuss them in detail.
Kalman filter (I): The sensor employs a standard Kalman filter to compute the state estimate and covariance recursively as
starting from and .
Kalman filter (II): The controller runs a minimum mean square error (MMSE) estimator to compute estimates of the subsystem’s state as follows:
(5)  
(6) 
with .
IiD Goal: minimizing the control loss
Under the assumptions presented in § II, the certainty equivalent controller is still optimal; see [9] for details. The total control loss in (4) has two components: (a) best possible control loss (b) error due to intermittent communications. Hence, the problem of minimizing control loss has two separate components: (i) designing the best (optimal) controller for each subsystem and (ii) scheduling in a controlaware manner.
Component I: Controller design. The controller in feedback loop takes the following control action, , at time :
(7) 
where is the state estimate used by the controller,
(8) 
and is recursively computed as
(9) 
with initial values . Let be the state estimate of Kalman Filter (I), as employed by the sensor. We have when the sensor and controller of the feedback loop have communicated. Otherwise, is the state estimate obtained from Kalman Filter (II). The minimum value of the control loss of subsystem is given by
(10) 
where and . Note that is the communication error in subsystem . Recall that there are subsystems and communication channels. If , then for all and .
Component II: Controlaware scheduling. The main aim of the scheduling algorithm is to help minimize of (3). To this end, one must minimize
(11) 
of (10) for every . Note that in (11) is the control horizon. At any time , the scheduler decides which among the subsystems may communicate. Note that when a communication channel is assigned to subsystem at time .
In the following section, we present a deep reinforcement learning algorithm for controlaware scheduling called DeepCAS. DeepCAS communicates only with the smart sensors. At every time instant, sensors are told if they can transmit to their associated controllers. Then, the sensors provide feedback on the scheduling decision for that stage. Note that we do not consider the overhead involved in providing feedback.
Iii Deep reinforcement learning for controlaware scheduling
As stated earlier, at the heart of our DeepCAS lies the DQN. The DQN is a modern variant of Qlearning that effectively counters Bellman’s curse of dimensionality. Essentially, DQN or Qlearning finds a solution to an associated Markov decision process (MDP) in an iterative modelfree manner. Before proceeding, let us recall the definition of an MDP. For a more detailed exposition, the reader is referred to [10]. An MDP, , is given by the following tuple , where

is the statespace of .

is the set of actions that can be taken.

is the transition probability, i.e., is the probability of transitioning to state when action is taken at state .

is the one stage reward function, i.e., is the reward when action is taken at state .

is the discount factor and .
Below, we state the MDP associated with our problem.

The state space consists of all possible augmented error vectors. Hence, state vector at time is given by .

Action space is . Hence, the cardinality (size), , of the action space equals .

At time , the reward is given by .

Although it would seem natural to use , we use since it hastens the rate of convergence.
Note that the scheduler (DeepCAS) takes action just before time and receives rewards just after time , based on transmissions at time . Also, note that DeepCAS only gets nonzero rewards from nontransmitting sensors. DeepCAS is modelfree. Hence, it does not need to know transition probabilities.
Let us suppose we use a reinforcement learning algorithm, such as Qlearning, to solve . Since the learning algorithm will find policies that minimize the future expected cumulative rewards, we expect to find policies that minimize scheduling effects on the entire system. This is a consequence of our above definition of reward . In the following section, we provide a brief overview of Qlearning and DQN, the reinforcement learning algorithm at the heart of our DeepCAS. Simply put, DeepCAS is a DQN solving the above defined MDP .
DeepCAS. At any time , the scheduler (DeepCAS) is interested in maximizing the following expected discounted future reward:
where is the single stage cost given by . learning is a useful methodology to solve such problems. It is based on finding the following Qfactor for every stateaction pair:
where is a policy that maps states to actions. The algorithm itself is based on the Bellman equation:
Since our state space is continuous, we use a deep neural network (DNN) for function approximation. Specifically, we try to find good approximations of the Qfactors iteratively. In other words, the neural network takes as input state and outputs for every possible action , such that . This deep function approximator, with weights , is referred to as a Deep QNetwork. The Deep QNetwork is trained by minimizing a timevarying sequence of loss functions given by
where is the expected costtogo based on the latest update of the weights; is the behavior distribution. Training the neural network involves finding , which minimizes the loss functions. Since the algorithm is online, training is done in conjunction with scheduling. At time , after feedback (reward) is received, one gradient descent step can be performed using the following gradient term:
(12) 
To make the algorithm implementable, we use a sample rather than find the above expectation, when updating weights. At each time, we pick actions using the greedy approach. Specifically, we pick a random action with probability , and we pick a greedy action with probability . This constitutes the previously mentioned behavior distribution , which governs how actions are picked. Note that a greedy action at time is one that maximizes . Initially it is desirable to explore, hence is set to . Once the algorithm has gained some experience, it is better to exploit this experience. To accomplish this, we use an attenuating .
Although we train our DNN in an online manner, we do not perform a gradient descent step using (12), since it can lead to poor learning. Instead, we store the previous experiences , , in an experience replay memory . In other words, at time , DeepCAS stores into whose size is . When it comes to training the neural network at time , it performs a single minibatch gradient descent step. The minibatch (of gradients) is randomly sampled from the aforementioned experience replay . The idea of using experience replay memory, to overcome biases and to have a stabilizing effect on algorithms, was introduced in [8].
DQN for controlaware scheduling
Iv Numerical results
We are now ready to present our numerical results. Recall that a DQN is at the heart of our DeepCAS, which uses a deep neural network to approximate Qfactors. Before proceeding, we specify the algorithm parameters. The input to the neural network is the appended error vector. The hidden layer consists of 1024 rectifier units. The output layer is a fully connected linear layer with a single output for each of the actions. The discount factor in our Qlearning algorithm is . The size of the experience replay buffer is fixed at . The exploration parameter is initialized to , then attenuated to at the rate of . For training the neural network, we use the optimizer ADAM [11] with a learning rate of and a decay of . Note that we used the same set of parameters for all of the experiments presented below.
Experiment 1: N=, M=, and T=. For our first experiment, we used DeepCAS to schedule one channel for a system with three subsystems, over a control horizon of length . We considered three secondorder singleinputsingleoutput (SISO) subsystems consisting of one stable (subsystem ) and two unstable subsystems (subsystems and ). If there were three channels, then there would be no scheduling problem and the total optimal control loss would be . Since there is a single channel, one expects a solution to the scheduling problem to allocate it to subsystems and for a more substantial fraction of the time, as compared to subsystem . This expectation is fair since subsystems and are unstable while subsystem is stable. Once trained, DeepCAS indeed allocates the channel to subsystem 1 for 52% of the time, to subsystem 2 for 12% of the time, and to subsystem 3 for 36% of the time.
We train DeepCAS continuously over many epochs. Each epoch corresponds to a single run of the control problem with horizon . At the start of each epoch, the initial conditions for the control problem are chosen as explained in § II. Fig. 2 illustrates the learning progress of DeepCAS. The abscissa axis of the graph represents the epoch number while the ordinate axis represents the average control loss. The plot is obtained by taking the mean of Monte Carlo runs. Since DQN is randomly initialized, scheduling decisions are poor at the beginning, and the average control loss is high. As learning proceeds, the decisions taken improve. After only epochs, DeepCAS converges to a scheduling strategy with an associated control loss of around .
Comparison to periodic scheduling. Traditionally, the problem of scheduling for control systems is solved by using control theoretic heuristics to find periodic schedules. We do the following to compare the performance of DeepCAS to such techniques. For every fixed , we perform an exhaustive search (among sequences) to find a scheduling sequence, of length , which minimizes the control loss. The results are listed in Table I^{1}^{1}1We stopped at since exhaustive search is extremely timeconsuming and impractical beyond that.. As illustrated in Table I, this methodology yields a minimum control loss of for . In comparison, DeepCAS finds a scheduling strategy with an associated control loss of . We conclude that, in addition to being faster, DeepCAS does not need any system specification and can schedule efficiently for very long control horizons.
Another approach to scheduling is to solve an associated mixedinteger quadratic program (MIQP); see [12] for details. Solving a MIQP accurately is computationally infeasible for all but small system sizes.
Length  Sequence  Total cost 

Experiment 2: N=, M=, and T=. For our second experiment, we train DeepCAS to schedule three channels for a system with six secondorder SISO subsystems. If , then the total control loss equals . As before, learning is done continuously over many epochs. Fig. 3 illustrates the learning progress of DeepCAS in scheduling three channels among six subsystems. The abscissa and ordinate axes are as before. As evidenced in the figure, DeepCAS quickly finds schedules with an associated control loss of around 20. Due to the difference in scale, the control loss in Fig. 3 seems to have high variance as compared to Fig. 2.
We are unable to compare the results of Experiment 2 with any traditional heuristics. This is because traditional heuristics do not extend to the system size and control horizon considered here. Further, performing an exhaustive search for finding periodic schedules is not possible since the number of possibilities are in the order of , where is the periodlength.
Scheduling in general control settings. The systems considered hitherto have independent subsystems. This facilitates the splitting of the total control cost into two components; see (10). The onestage reward in our algorithm is the negative of the error due to lack of communication defined in (11). However, in general multiagent settings, the previously mentioned splitting may not be possible.
Now, we discuss an extension to our algorithm which will allow for scheduling in these general scenarios. For such an extension, we observe that one merely needs to provide a recipe for obtaining the statespace and reward in the definition of (associated MDP). Since we want our scheduler to reduce the control loss, it seems that a natural choice for reward is the negative of the onestage control cost. Regarding statespace, one may need to consider the state (or its estimate) controlaction pairs, say , of the system. It may also be necessary to append the pairs of all the subsystems. The expected control cost associated with taking action while in some state does not change with time. This allows for a consistent definition of . It may be noted that there may be other ways to define an associated MDP.
To show that the above extension is viable, we repeated Experiments 1 and 2 with negative of the onestage control cost as the reward. The results of the modified experiments are very similar to the original ones. Due to constraints in space, we only present Fig. 4 associated with the modification of Experiment 1.
V Conclusions
This paper considered the problem of scheduling the sensortocontroller communication in a networked control system, consisting of a multitude of independent control subsystems, in the presence of scarce communication resources. For this reason, we developed DeepCAS, i.e., a reinforcement learningbased controlaware scheduling algorithm. This algorithm is modelfree and scalable, and it outperforms scheduling heuristics tailored for feedback control applications. Specifically, we compared our algorithm with periodic schedules. Also, we briefly discussed how DeepCAS could be extended to a networked control system, containing coupled subsystems.
As stated earlier, this paper does not address the problem associated with overheads. An exciting future direction could be the development of sophisticated reinforcement learning algorithm which can reduce overhead. Addressing the aforementioned issues may lead to the associated problem of large discrete action spaces. The algorithms thus developed would need to take this into account. Lastly, it is also interesting to consider imperfect communication scenarios.
References
 [1] P. Park, S. C. Ergen, C. Fischione, C. Lu, and K. H. Johansson, “Wireless network design for control systems: a survey,” IEEE Communications Surveys & Tutorials, 2018.
 [2] H. Rehbinder and M. Sanfridson, “Scheduling of a limited communication channel for optimal control,” Automatica, vol. 40, no. 3, pp. 491–500, March 2004.
 [3] D. HristuVarsakelis and L. Zhang, “LQG control of networked control systems,” International Journal of Control, vol. 81, no. 8, pp. 1266–1280, 2008.
 [4] L. Shi, P. Cheng, and J. Chen, “Optimal periodic sensor scheduling with limited resources,” IEEE Transactions on Automatic Control, vol. 56, no. 9, pp. 2190–2195, 2011.
 [5] L. Orihuela, A. Barreiro, F. GómezEstern, and F. R. Rubio, “Periodicity of Kalmanbased scheduled filters,” IEEE Transactions on Automatic Control, vol. 50, no. 10, pp. 2672–2676, 2014.
 [6] L. Zhao, W. Zhang, J. Hu, A. Abate, and C. J. Tomlin, “On the optimal solutions of the infinitehorizon linear sensor scheduling problem,” IEEE Transactions on Automatic Control, vol. 59, no. 10, pp. 2825–2830, March 2014.
 [7] Y. Mo, E. Garone, and B. Sinopoli, “On infinitehorizon sensor scheduling,” Systems & Control Letters, vol. 67, pp. 65–70, May 2014.
 [8] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” in NIPS Deep Learning Workshop, 2013.
 [9] B. Demirel, A. S. Leong, V. Gupta, and D. E. Quevedo, “Tradeoffs in stochastic eventtriggered control,” arXiv:1708.02756, 2017.
 [10] D. P. Bertsekas and J. N. Tsitsiklis, NeuroDynamic Programming. Athena Scientific, 1996.
 [11] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceeding of the International Conference for Learning Representations, 2015.
 [12] T. Charalambous, A. Ozcelikkale, M. Zanon, P. Falcone, and H. Wymeersch, “On the resource allocation problem in wireless networked control systems,” in Proceedings of the IEEE Conference on on Decision and Control, 2017.