DeepCAS: A Deep Reinforcement Learning Algorithm for Control-Aware Scheduling
We consider networked control systems consisting of multiple independent closed-loop control subsystems, operating over a shared communication network. Such systems are ubiquitous in cyber-physical systems, Internet of Things, and large-scale industrial systems. In many large-scale settings, the size of the communication network is smaller than the size of the system. In consequence, scheduling issues arise. The main contribution of this paper is to develop a deep reinforcement learning-based control-aware scheduling (DeepCAS) algorithm to tackle these issues. We use the following (optimal) design strategy: First, we synthesize an optimal controller for each subsystem; next, we design learning algorithm that adapts to the chosen subsystem (plant) and controller. As a consequence of this adaptation, our algorithm finds a schedule that minimizes the control loss. We present empirical results to show that DeepCAS finds schedules with better performance than periodic ones. Finally, we illustrate that our algorithm can be used for scheduling and resource allocation in more general networked control settings than the above-mentioned one.
Deep Learning; Reinforcement Learning; Optimal Control; Networked Control Systems
Today’s cyber-physical systems (CPS), Internet of Things (IoT), large-scale industrial systems, and myriad real-world systems try to integrate traditional control systems with artificial intelligence to improve the overall system performance and reliability. Examples of such systems include smart grids, smart cities, city-wide vehicular traffic control, industrial process optimization, among others. In recent years, artificial intelligence (AI) has seen a resurgence as an effective model-free solution to many problems, including optimal control problems arising in the aforementioned systems. This resurgence is partly owing to the advances in computational capacities and the advent of deep neural networks for function approximation and feature extraction. Oftentimes, the use of reinforcement learning algorithms or AI in conjunction with traditional controllers reduces the complexity of system design while boosting efficiency.
The large size of systems poses tremendous challenges in resource allocation. In applications involving feedback, this challenge is exaggerated since resource allocation is required to be “control-aware”, i.e., it should reduce control loss. Typical resources including communication channels and computational resources do not scale with system size. In large-scale systems, controllers often rely on information collected from various sensors to make intelligent decisions. Hence, efficient information dispersion is essential for decision making over communication networks to be effective. As noted earlier, this is a hard problem since the number of communication channels available is much smaller than what is ideally required to transfer data from sensors to controllers.
In this paper, we present a deep reinforcement learning-based scheduling algorithm for the sensor-to-controller communication in large-scale networked control systems. To illustrate our ideas, we use the system architecture illustrated in Fig. 1. Later, we will show that our ideas can be used for scheduling in more general architectures.
Fig. 1 is a simplified representation of (some) CPS and IoT systems. The system consists of independent control subsystems that communicate over a shared communication network, which contains channels. We assume that ( is much smaller than ) and, in particular, that each channel can be used to serve any communication request. Each control subsystem consists of a smart sensor, a controller, and a plant to be controlled. The feedback loops are closed over this resource-constrained communication network. For more details on the system model, we refer the reader to § II.
DeepCAS, our deep reinforcement learning-based model-free scheduling algorithm, decides which of the subsystems are allocated to the channels in a given time instant. Further, it adapts to the controller at hand and finds a schedule that minimizes the control loss. We present an implementation wherein DeepCAS obtains feedbacks (i.e., rewards) from smart sensors for making decisions. At each time instant, each smart sensor computes the estimate of the subsystem’s state, using Kalman Filter (I), and transmits it to the corresponding controller; see Fig. 1. The controller runs Kalman Filter (II) to estimate the subsystem’s state in the absence of transmissions. In addition to Kalman Filter (I), a smart sensor also implements a copy of Kalman Filter (II) and the control algorithm. This way the smart sensor knows the state estimate used by the controller at every time instant.
Several scheduling approaches for control systems have been proposed in the literature to determine the access order of different sensors and/or actuators; see  and references therein. A widely considered approach is to employ periodic schedules [2, 3, 4, 5]. However, finding such schedules for control applications may not be easy since both specific period and sequence need to be found. Further, periodic sequences may not be even optimal in which case searching for them is futile. With a few exceptions (e.g.,  and ), the determination of optimal schedules indeed requires solving a mixed-integer quadratic program, which is computationally infeasible for all but very small systems.
Our contribution is in the development of a deep reinforcement learning-based control-aware scheduling (DeepCAS) algorithm. At its heart lies the Deep Q-Network (DQN), a modern variant of Q learning, introduced in . In addition to being readily scalable, DeepCAS is completely model-free. To optimize the overall control performance, we propose the following (optimal) design of control and scheduling: In the first step, we design an optimal controller for each independent subsystem. As discussed in , under limited communication, the control loss has two components: (a) best possible control loss (b) error due to intermittent transmissions. If , then (b) vanishes. Since we are in setting of , the goal of the scheduler is to minimize (b). To this end, we first construct a Markov decision process (MDP). The state space of this MDP is the difference in state estimates of all controllers and sensors (obtainable from the smart sensors). Its reward is the negative of the previously mentioned (b). Since we are using DQN (reinforcement learning) to solve this MDP, we do not need the knowledge of transition probabilities. Since the goal of DQN is to find a scheduling strategy that maximizes reward, it naturally aims at minimizing (b). For more details on the MDP, the reader is referred to § III. We present empirical results showing that DeepCAS finds schedules that have lower control losses than traditional hand-tuned heuristics.
The organization of this paper is as follows: Section II introduces models and assumptions for subsystems, sensors, controllers, and network. This section also presents the primary objective of the paper and how to reach this objective via solving control design and control-aware scheduling problems separately. Section III presents an MDP associated with control-aware scheduling problems and proposes a deep reinforcement learning algorithm to solve this MDP efficiently. In Section IV, numerical examples are used to illustrate the power of our AI-based scheduling algorithm. Section V gives concluding remarks and directions for future work.
Ii Networked Control System: Model, Assumptions, and Goals
Ii-a Model for each subsystem
As illustrated in Figure 1, our networked control system consists of independent closed-loop subsystems. The feedback loop within each subsystem (plant) is closed over a shared communication network. For , subsystem is described by
where and are matrices of appropriate dimensions, is the subsystem ’s state, is the control input, and is zero-mean i.i.d. Gaussian noise with covariance matrix . The initial state of subsystem , , is assumed to be a Gaussian random vector with mean and covariance matrix .
At a given time , we assume that only noisy output measurements are available. We thus have:
where is zero-mean i.i.d. Gaussian noise with covariance matrix . All noise sources, and , are independent of the initial conditions .
Ii-B Control architecture and loss function
The dynamics of each subsystem is a stochastic linear time-invariant (LTI) system given by (1). Further, each subsystem is independently controlled although dependencies do arise from sharing a communication network. Subsystem has a smart sensor which samples the subsystem’s output and computes an estimate of the subsystem’s state. This value is then sent to the associated controller, provided a channel is allocated to it by DeepCAS. If the controller obtains a new state estimate from the sensor, then it calculates a control command based on this state estimate. Otherwise, it calculates a control command based on its own estimate of the subsystem’s state.
The control actions and scheduling decisions (of DeepCAS) are taken to minimize the total control loss given by
where is the control loss of subsystem and is given by
where and are positive semi-definite matrices and is positive definite.
Ii-C Smart sensors and pre-processing units
Within our setting, the primary role of a smart sensor is to take measurements of a subsystem’s output. Also, it plays a vital role in helping DeepCAS with scheduling decisions. It is from the smart sensors that DeepCAS gets all the necessary feedback information for scheduling. For these tasks, each smart sensor employs two Kalman filters: (1) Kalman Filter (I) is used to estimate the subsystem’s state, (2) a copy of Kalman Filter (II) is used to estimate the subsystem’s state as perceived by the controller. Note that the controller employs Kalman Filter (II). Below, we discuss them in detail.
Kalman filter (I): The sensor employs a standard Kalman filter to compute the state estimate and covariance recursively as
starting from and .
Kalman filter (II): The controller runs a minimum mean square error (MMSE) estimator to compute estimates of the subsystem’s state as follows:
Ii-D Goal: minimizing the control loss
Under the assumptions presented in § II, the certainty equivalent controller is still optimal; see  for details. The total control loss in (4) has two components: (a) best possible control loss (b) error due to intermittent communications. Hence, the problem of minimizing control loss has two separate components: (i) designing the best (optimal) controller for each subsystem and (ii) scheduling in a control-aware manner.
Component I: Controller design. The controller in feedback loop takes the following control action, , at time :
where is the state estimate used by the controller,
and is recursively computed as
with initial values . Let be the state estimate of Kalman Filter (I), as employed by the sensor. We have when the sensor and controller of the feedback loop have communicated. Otherwise, is the state estimate obtained from Kalman Filter (II). The minimum value of the control loss of subsystem is given by
where and . Note that is the communication error in subsystem . Recall that there are subsystems and communication channels. If , then for all and .
Component II: Control-aware scheduling. The main aim of the scheduling algorithm is to help minimize of (3). To this end, one must minimize
of (10) for every . Note that in (11) is the control horizon. At any time , the scheduler decides which among the subsystems may communicate. Note that when a communication channel is assigned to subsystem at time .
In the following section, we present a deep reinforcement learning algorithm for control-aware scheduling called DeepCAS. DeepCAS communicates only with the smart sensors. At every time instant, sensors are told if they can transmit to their associated controllers. Then, the sensors provide feedback on the scheduling decision for that stage. Note that we do not consider the overhead involved in providing feedback.
Iii Deep reinforcement learning for control-aware scheduling
As stated earlier, at the heart of our DeepCAS lies the DQN. The DQN is a modern variant of Q-learning that effectively counters Bellman’s curse of dimensionality. Essentially, DQN or Q-learning finds a solution to an associated Markov decision process (MDP) in an iterative model-free manner. Before proceeding, let us recall the definition of an MDP. For a more detailed exposition, the reader is referred to . An MDP, , is given by the following tuple , where
is the state-space of .
is the set of actions that can be taken.
is the transition probability, i.e., is the probability of transitioning to state when action is taken at state .
is the one stage reward function, i.e., is the reward when action is taken at state .
is the discount factor and .
Below, we state the MDP associated with our problem.
The state space consists of all possible augmented error vectors. Hence, state vector at time is given by .
Action space is . Hence, the cardinality (size), , of the action space equals .
At time , the reward is given by .
Although it would seem natural to use , we use since it hastens the rate of convergence.
Note that the scheduler (DeepCAS) takes action just before time and receives rewards just after time , based on transmissions at time . Also, note that DeepCAS only gets non-zero rewards from non-transmitting sensors. DeepCAS is model-free. Hence, it does not need to know transition probabilities.
Let us suppose we use a reinforcement learning algorithm, such as Q-learning, to solve . Since the learning algorithm will find policies that minimize the future expected cumulative rewards, we expect to find policies that minimize scheduling effects on the entire system. This is a consequence of our above definition of reward . In the following section, we provide a brief overview of Q-learning and DQN, the reinforcement learning algorithm at the heart of our DeepCAS. Simply put, DeepCAS is a DQN solving the above defined MDP .
DeepCAS. At any time , the scheduler (DeepCAS) is interested in maximizing the following expected discounted future reward:
where is the single stage cost given by . -learning is a useful methodology to solve such problems. It is based on finding the following Q-factor for every state-action pair:
where is a policy that maps states to actions. The algorithm itself is based on the Bellman equation:
Since our state space is continuous, we use a deep neural network (DNN) for function approximation. Specifically, we try to find good approximations of the Q-factors iteratively. In other words, the neural network takes as input state and outputs for every possible action , such that . This deep function approximator, with weights , is referred to as a Deep Q-Network. The Deep Q-Network is trained by minimizing a time-varying sequence of loss functions given by
where is the expected cost-to-go based on the latest update of the weights; is the behavior distribution. Training the neural network involves finding , which minimizes the loss functions. Since the algorithm is online, training is done in conjunction with scheduling. At time , after feedback (reward) is received, one gradient descent step can be performed using the following gradient term:
To make the algorithm implementable, we use a sample rather than find the above expectation, when updating weights. At each time, we pick actions using the -greedy approach. Specifically, we pick a random action with probability , and we pick a greedy action with probability . This constitutes the previously mentioned behavior distribution , which governs how actions are picked. Note that a greedy action at time is one that maximizes . Initially it is desirable to explore, hence is set to . Once the algorithm has gained some experience, it is better to exploit this experience. To accomplish this, we use an attenuating .
Although we train our DNN in an online manner, we do not perform a gradient descent step using (12), since it can lead to poor learning. Instead, we store the previous experiences , , in an experience replay memory . In other words, at time , DeepCAS stores into whose size is . When it comes to training the neural network at time , it performs a single mini-batch gradient descent step. The mini-batch (of gradients) is randomly sampled from the aforementioned experience replay . The idea of using experience replay memory, to overcome biases and to have a stabilizing effect on algorithms, was introduced in .
Iv Numerical results
We are now ready to present our numerical results. Recall that a DQN is at the heart of our DeepCAS, which uses a deep neural network to approximate Q-factors. Before proceeding, we specify the algorithm parameters. The input to the neural network is the appended error vector. The hidden layer consists of 1024 rectifier units. The output layer is a fully connected linear layer with a single output for each of the actions. The discount factor in our Q-learning algorithm is . The size of the experience replay buffer is fixed at . The exploration parameter is initialized to , then attenuated to at the rate of . For training the neural network, we use the optimizer ADAM  with a learning rate of and a decay of . Note that we used the same set of parameters for all of the experiments presented below.
Experiment 1: N=, M=, and T=. For our first experiment, we used DeepCAS to schedule one channel for a system with three subsystems, over a control horizon of length . We considered three second-order single-input-single-output (SISO) subsystems consisting of one stable (subsystem ) and two unstable subsystems (subsystems and ). If there were three channels, then there would be no scheduling problem and the total optimal control loss would be . Since there is a single channel, one expects a solution to the scheduling problem to allocate it to subsystems and for a more substantial fraction of the time, as compared to subsystem . This expectation is fair since subsystems and are unstable while subsystem is stable. Once trained, DeepCAS indeed allocates the channel to subsystem 1 for 52% of the time, to subsystem 2 for 12% of the time, and to subsystem 3 for 36% of the time.
We train DeepCAS continuously over many epochs. Each epoch corresponds to a single run of the control problem with horizon . At the start of each epoch, the initial conditions for the control problem are chosen as explained in § II. Fig. 2 illustrates the learning progress of DeepCAS. The abscissa axis of the graph represents the epoch number while the ordinate axis represents the average control loss. The plot is obtained by taking the mean of Monte Carlo runs. Since DQN is randomly initialized, scheduling decisions are poor at the beginning, and the average control loss is high. As learning proceeds, the decisions taken improve. After only epochs, DeepCAS converges to a scheduling strategy with an associated control loss of around .
Comparison to periodic scheduling. Traditionally, the problem of scheduling for control systems is solved by using control theoretic heuristics to find periodic schedules. We do the following to compare the performance of DeepCAS to such techniques. For every fixed , we perform an exhaustive search (among sequences) to find a scheduling sequence, of length , which minimizes the control loss. The results are listed in Table I111We stopped at since exhaustive search is extremely time-consuming and impractical beyond that.. As illustrated in Table I, this methodology yields a minimum control loss of for . In comparison, DeepCAS finds a scheduling strategy with an associated control loss of . We conclude that, in addition to being faster, DeepCAS does not need any system specification and can schedule efficiently for very long control horizons.
Another approach to scheduling is to solve an associated mixed-integer quadratic program (MIQP); see  for details. Solving a MIQP accurately is computationally infeasible for all but small system sizes.
Experiment 2: N=, M=, and T=. For our second experiment, we train DeepCAS to schedule three channels for a system with six second-order SISO subsystems. If , then the total control loss equals . As before, learning is done continuously over many epochs. Fig. 3 illustrates the learning progress of DeepCAS in scheduling three channels among six subsystems. The abscissa and ordinate axes are as before. As evidenced in the figure, DeepCAS quickly finds schedules with an associated control loss of around 20. Due to the difference in scale, the control loss in Fig. 3 seems to have high variance as compared to Fig. 2.
We are unable to compare the results of Experiment 2 with any traditional heuristics. This is because traditional heuristics do not extend to the system size and control horizon considered here. Further, performing an exhaustive search for finding periodic schedules is not possible since the number of possibilities are in the order of , where is the period-length.
Scheduling in general control settings. The systems considered hitherto have independent subsystems. This facilitates the splitting of the total control cost into two components; see (10). The one-stage reward in our algorithm is the negative of the error due to lack of communication defined in (11). However, in general multi-agent settings, the previously mentioned splitting may not be possible.
Now, we discuss an extension to our algorithm which will allow for scheduling in these general scenarios. For such an extension, we observe that one merely needs to provide a recipe for obtaining the state-space and reward in the definition of (associated MDP). Since we want our scheduler to reduce the control loss, it seems that a natural choice for reward is the negative of the one-stage control cost. Regarding state-space, one may need to consider the state (or its estimate) control-action pairs, say , of the system. It may also be necessary to append the pairs of all the subsystems. The expected control cost associated with taking action while in some state does not change with time. This allows for a consistent definition of . It may be noted that there may be other ways to define an associated MDP.
To show that the above extension is viable, we repeated Experiments 1 and 2 with negative of the one-stage control cost as the reward. The results of the modified experiments are very similar to the original ones. Due to constraints in space, we only present Fig. 4 associated with the modification of Experiment 1.
This paper considered the problem of scheduling the sensor-to-controller communication in a networked control system, consisting of a multitude of independent control subsystems, in the presence of scarce communication resources. For this reason, we developed DeepCAS, i.e., a reinforcement learning-based control-aware scheduling algorithm. This algorithm is model-free and scalable, and it outperforms scheduling heuristics tailored for feedback control applications. Specifically, we compared our algorithm with periodic schedules. Also, we briefly discussed how DeepCAS could be extended to a networked control system, containing coupled subsystems.
As stated earlier, this paper does not address the problem associated with overheads. An exciting future direction could be the development of sophisticated reinforcement learning algorithm which can reduce overhead. Addressing the aforementioned issues may lead to the associated problem of large discrete action spaces. The algorithms thus developed would need to take this into account. Lastly, it is also interesting to consider imperfect communication scenarios.
-  P. Park, S. C. Ergen, C. Fischione, C. Lu, and K. H. Johansson, “Wireless network design for control systems: a survey,” IEEE Communications Surveys & Tutorials, 2018.
-  H. Rehbinder and M. Sanfridson, “Scheduling of a limited communication channel for optimal control,” Automatica, vol. 40, no. 3, pp. 491–500, March 2004.
-  D. Hristu-Varsakelis and L. Zhang, “LQG control of networked control systems,” International Journal of Control, vol. 81, no. 8, pp. 1266–1280, 2008.
-  L. Shi, P. Cheng, and J. Chen, “Optimal periodic sensor scheduling with limited resources,” IEEE Transactions on Automatic Control, vol. 56, no. 9, pp. 2190–2195, 2011.
-  L. Orihuela, A. Barreiro, F. Gómez-Estern, and F. R. Rubio, “Periodicity of Kalman-based scheduled filters,” IEEE Transactions on Automatic Control, vol. 50, no. 10, pp. 2672–2676, 2014.
-  L. Zhao, W. Zhang, J. Hu, A. Abate, and C. J. Tomlin, “On the optimal solutions of the infinite-horizon linear sensor scheduling problem,” IEEE Transactions on Automatic Control, vol. 59, no. 10, pp. 2825–2830, March 2014.
-  Y. Mo, E. Garone, and B. Sinopoli, “On infinite-horizon sensor scheduling,” Systems & Control Letters, vol. 67, pp. 65–70, May 2014.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” in NIPS Deep Learning Workshop, 2013.
-  B. Demirel, A. S. Leong, V. Gupta, and D. E. Quevedo, “Trade-offs in stochastic event-triggered control,” arXiv:1708.02756, 2017.
-  D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Athena Scientific, 1996.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceeding of the International Conference for Learning Representations, 2015.
-  T. Charalambous, A. Ozcelikkale, M. Zanon, P. Falcone, and H. Wymeersch, “On the resource allocation problem in wireless networked control systems,” in Proceedings of the IEEE Conference on on Decision and Control, 2017.