Networked Control of Nonlinear Systems under Partial Observation Using Continuous Deep QLearning
Abstract
In this paper, we propose a design of a modelfree networked controller for a nonlinear plant whose mathematical model is unknown. In a networked control system, the controller and plant are located away from each other and exchange data over a network, which causes network delays that may fluctuate randomly due to network routing. So, in this paper, we assume that the current network delay is not known but the maximum value of fluctuating network delays is known beforehand. Moreover, we also assume that the sensor cannot observe all state variables of the plant. Under these assumption, we apply continuous deep Qlearning to the design of the networked controller. Then, we introduce an extended state consisting of a sequence of past control inputs and outputs as inputs to the deep neural network. By simulation, it is shown that, using the extended state, the controller can learn a control policy robust to the fluctuation of the network delays under the partial observation.
I Introduction
Reinforcement learning (RL) is one of theoretical frameworks in machine leaning (ML) and a dynamic programingbased learning approach to search of an optimal control policy [1]. RL is useful to design a controller for a plant whose mathematical model is unknown because RL is based on a modelfree learning approach, that is, we can design the controller without plant’s model. In RL, a controller that is a learner interacts with the plants and updates its control policy. The main goal of RL is to learn the control policy that maximizes the longterm rewards. RL has been applied to various control problems [2][8].
Furthermore, to design a controller for a complicated system, we often use function approximations. It is known that policy gradient methods [9, 10] with function approximations are especially useful in control problems, where states of plants and control inputs are continuous values. Recently, deep reinforcement learning (DRL) has been actively researched. In DRL, we combine conventional RL algorithms with deep neural networks (DNNs) as highperformance function approximators [11][14]. In [15], DRL is applied to a parking problem of a 4wheeled vehicle that is a nonholonomic system. In [16], DRL is applied to controlling robot manipulators. Moreover, applications of DRL to networked control systems (NCSs) have also been proposed [17], [18], [20]. NCSs have been much attention to thanks to the development of network technologies. In NCSs, the controller and the plant are located away from each other. The controller computes control inputs based on outputs observed by the sensor, and sends them to the plant via a network. In [17], DRL is applied to controlaware scheduling of NCSs consisting of multiple subsystems operating over a shared network. In [18], DRL is applied to event trigger control (ETC) [19]. However, in [17] and [18], transmission delays over the network are not considered. One of the problems of NCSs is that there are network delays in the exchange of data between the controller and the plant. In the case where the network delays are constant and parameters of the network delays are known, we can design the networked controller considering the network delays. However, practically, it is difficult to identify the network delays beforehand. Moreover, the network delays may fluctuate due to the network routing. Thus, in [20], we assume that the sensor can observe all state variables of the plant and proposed the design of networked controller with network delays using a DRL algorithm. In general, however, the sensor cannot always observe all of them. In RL, the partial observation often degrades learning performances of the controllers.
In this paper, we consider the following networked control system;

The plant is a nonlinear system whose mathematical model is unknown.

Network delays fluctuate randomly due to the network routing, where the maximum value of them is known beforehand.

The sensor cannot observe all state variables of the plant.
Under the above assumptions, we propose a networked controller with a DNN using the continuous deep Qlearning algorithm [13]. Then, we introduce an extended state consisting of both past control inputs and outputs of the plant as inputs to the DNN.
The paper is organized as follows. In Section II, we review continuous deep Qlearning. In Section III, we propose a networked controller using a DNN under the above three assumptions. In Section IV, by simulation, we apply the proposed learning algorithm to a networked controller for stabilizing a Chua circuit under the fluctuating network delays and the partial observation. In Section V, we conclude the paper.
Ii Preliminaries
This section reviews RL and continuous deep Qlearning that is one of DRL algorithms.
Iia Reinforcement Learning (RL)
The main goal of RL is for a controller to learn its optimal control policy by trial and error while interacting with a plant.
Let and be the sets of states and control inputs of the plant, respectively. The controller receives the immediate reward by the following function .
(1) 
where and are the state and the control input at discretetime . In RL, it is necessary to evaluate the policy based on longterm rewards. Thus, the value function and Qfunction are defined as follows.
(2)  
(3) 
where is the evaluated control policy, that is, the input at the state is determined by , and is the discount factor to prevent the divergence of the longterm rewards.
In the Qlearning algorithm, the controller indirectly learns the following greedy deterministic policy through updating the Qfunction.
(4) 
IiB Continuous Deep Qlearning with Normalized Advantage Function
To implement the Qlearning algorithm for plants whose state and control input are continuous, Gu et al. proposed a parameterized quadratic function , called a normalized advantage function (NAF) [13], satisfying the following equations, where is a parameter vector of the DNN.
(5)  
where is a positive definite matrix and and are approximaitors to Eqs. (2) and (3), respectively. computes the optimal control input instead of Eq. (4), where the control input maximizes the approximated Qfunction instead of Eq.(3). In the other words, the approximated Qfunction is divided into an actiondependent term and an actionindependent term, and the actiondependent term is expressed by the quadratic function with respect to the action. From Eqs. (5) and (IIB), when , the Qfunction is maximized with respect to the action and we have
(7) 
We show an illustration of a DNN for continuous deep Qlearning in Fig. 1. The outputs of the DNN consist of the approximated value , the optimal control input , and the parameters that constitute the lower triangular matrix , where the diagonal terms are exponentiated. Moreover, the positive definite matrix is given by .
Iii Continuous Deep QLearningBased Network Control
Iiia Networked Control System
We consider the networked control of the following nonlinear plant as shown in Fig. 2.
(8)  
(9) 
where

is the state of the plant at time ,

is the control input of the plant at time and the th updated control input computed by the digital controller is denoted by ,

is the th output of the plant observed by the sensor,

is the sampling period of the sensor,

describes the mathematical model of the plant, but it is assumed to be unknown, and

is the output function of the plant that is characterized by the sensor.
The discretetime control input is sent to the D/A converter and held until the next input is received. The plant and the digital controller are connected by information networks and there are two types of network delays: One is caused by transmissions of the observed outputs from the sensor to the controller while the other is by those of the updated control inputs from the digital controller to the plant. The former and the latter delay at discretetime are denoted by and , respectively. Then, for ,
(10) 
We assume that the packet loss does not occur in the networks and all data are received in the same order as their sending order. We assume that both network delays are upper bounded and the maximum values of them are known beforehand.
IiiB Extended State
We assume that the maximum values of and are known and let and ( and ).
First, we consider randomly fluctuated delays. The sensor observes the th output at . The controller receives the th output and computes the th control input at . The th control input is inputted to the plant at . Then, the controller must estimate the state and compute . Since , we use the last control inputs that the controller needs in the worst case to estimate the state of the plant as shown in Fig. 3. Thus, in [20], we introduced the extended state and use it as the input of a DNN. We also use the past control input sequence. However, in this paper, the sensor cannot observe all state variables .
Second, we consider a partial observation. In [21], Aangenent et al. proposed a databased optimal control method using past control input and output sequence in the case where the plant is a linear, controllable, and observable system. Then, the length of the sequence must be larger than the observability index of the plant, where . Similarly, in this paper, we also use the past control input and output sequence to estimate the state as shown in Fig. 4, where the hyper parameter is selected beforehand. Although there is no theoretical guarantee, we select conservatively such that . We define the following extended state .
(11) 
Thus, we design the networked controller with a DNN as shown in Fig. 5.
IiiC DRL algorithm
The parameter vector of the DNN for the controller is optimized by the continuous deep Qleaning algorithm [13]. The input to the DNN is the extended state . Shown in Algorithm 1 is the proposed learning algorithm. In the same way as the DQN algorithm [11], we use the experience replay and the target network. The parameter vectors of the main network and the target network are denoted by and , respectively. For the update of , the following TD error is used.
(12)  
(13) 
Moreover, for the update of , the following soft update is used.
(14) 
where is given as a very small positive real number.
The exploration noises are generated under a given random process .
Iv Simulation
We apply the proposed controller to a stabilization of a Chua circuit as follows. We set the sampling period to . Moreover, we assume that the terminal of a leaning episode is at .
Iva Chua circuit
The dynamics of a Chua circuit is given by
(15) 
where . In this simulation, we assume that and , where these parameters are unknown. The Chua circuit has a chaotic attractor and a limit cycle as shown in Fig. 6. We assume that the states and are sensed as follows.
(16) 
We assume that equilibrium points of the Chua circuit is unknown because the controller does not know the parameters and . Thus, we define the reward function in this simulation as follows.
First, we define the reward based on outputs and the control input .
(17) 
Second, we define the reward based on the past output sequence used for the extended state.
(18) 
Third, we define the reward based on the past control input sequence used for the extended state.
(19) 
Finally, we define the immediate reward at discretetime as follows.
(20) 
Thus, the goal of DRL is the stabilization of the circuit at one of equilibrium points.
IvB Design of the controller
We use a DNN with four hidden layers, where all hidden layers have 128 units and all layers are fully connected layers. The activation functions are ReLU except for the output layer. Regarding the activation functions of the output layer, we use a linear function for both the unit and units for parameters of the advantage function, while we use a weighted hyperbolic tangent function for the unit. The size of the replay memory is and the minibatch size is 128. The parameters of the DNN are updated 10 times per 4 discrete time steps () by ADAM [22], where its learning stepsize is . The soft update rate for the target network is 0.001, and the discount rate for the Qvalue is 0.99.
For the exploration noise process, we use an OrnsteinUhlenbeck process [23]. The exploration noises are multiplied by 3.5 during the 1st to the 1000th episode, and the noise is gradually reduced after the 1001st episode. The initial state is randomly selected for each episode, where , , and .
IvC Result
First, we assume that, for all , the network delays are set to and . These ranges are unknown. However, we assume that we know and beforehand. Thus, we set . Moreover, we select , that is, is larger than the dimension of the plant’s state (). The learning curve is shown in Fig. 7, where the values of the vertical axis, called rewards, are given by the sum of between the 50th sampling and the episode’s terminal for each episode. It is shown that the controller can learn a control policy that achieves a high reward. Moreover, shown in Figs. 8 and 9 are the time responses of the Chua circuit using the control policy after 8500 episodes. It is shown that the controller that sufficiently learned the control policy using the proposed method can stabilize the circuit.
V Conclusion
In this paper, we proposed a modelfree networked controller for a nonlinear plant with network delays using continuous deep Qlearning. Moreover, the sensor cannot observe all state variables of the plant. Thus, we introduce an extended state consisting of a sequence of the past control inputs and outputs and use it as an input to a DNN. We showed the usefulness of the proposed controller by stabilizing a Chua circuit. It is future work to extend the proposed controller under the existence of packet loss and sensing noises.
References
 [1] R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction,” Bradford Book. MIT Press, Cambridge, Massachusetts, 1999.
 [2] F. L. Lewis and D. Vrabie, “Reinforcement Learning and Adaptive Dynamic Programming for Feedback Control,” IEEE Circuits and Systems Magazine, vol. 9, no. 3, pp. 3250, 2009.
 [3] F. L. Lewis and D. Liu, “Reinforcement Learning and Approximate Dynamic Programming for Feedback Control,” IEEE Press, 2013.
 [4] F. L. Lewis and K. G. Vamvoudakis, “Reinforcement Learning for Partially Observable Dynamic Processes: Adaptive Dynamic Programming Using Measured Output Data,” IEEE Trans. Systems, Man, and Cybernetics. Part BCybernetics, vol. 41, no. 1, pp. 1425, 2011.
 [5] T. Fujita and T. Ushio, “RLbased Optimal Networked Control Considering Networked Delay of Discretetime Linear Systems,” in Proc. of 2015 IEEE ECC, pp. 24812486, 2015.
 [6] T. Fujita and T. Ushio, “Optimal Digital Control with Uncertain Network Delay of Linear Systems Using Reinforcement Learning,” IEICE Trans. Fundamanetals, vol. E99A no. 2, pp. 454461, 2016.
 [7] E. M. Wolff, U. Topcu, and R. M. Murray, “Robust Control of Uncertain Markov Decision Processes with Temporal Logic Specications,” in Proc. of 2012 IEEE CDC, pp. 33723379, 2012.
 [8] M. Hiromoto and T. Ushio, “Learning an Optimal Control Policy for a Markov Decision Process Under Linear Temporal Logic Specications,” in Proc. of IEEE Symposium Series on Computational Intelligence, pp. 548555, 2015.
 [9] R. S Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy Gradient Methods for Reinforcement Learning with Function Approximation,” in Proc. of the 12th NIPS, pp. 10571063, 1999.
 [10] D. Silver, G. Lever, N. Heess, T. Degris, D Wierstra, and M. Riedmiller, “Deterministic Policy Gradient Algorithms,” in Proc. of the 31st ICML, vol. 32, pp. 387395, 2014.
 [11] V. Mnih, K. Kavukcuoglu, D. Silve, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “HumanLevel Control through Deep Reinforcement Learning,” Nature, vol. 518, pp. 529533, 2015.
 [12] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous Control with Deep Reinforcement Learning,” arXiv preprint arXiv:1509.02971, 2016.
 [13] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous Deep QLearning with Modelbased Acceleration,” in Proc. of the 33rd ICML, pp. 28292838, 2016.
 [14] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Harley, T. P. Lillicrap, D. Silver, and K. Kavukcuoglu, “Asynchronous Methods for Deep Reinforcement Learning,” in Proc. of the 33rd ICML, pp. 19281937, 2016.
 [15] N. Masuda and T. Ushio, “Control of Nonholonomic Vehicle System Using Hierarchical Deep Reinforcement Learning,” in Proc. NOLTA2017, pp. 2629, 2017.
 [16] B. Sangiovanni, G. P. Incremona, A. Ferrara, and M. Piastra, “Deep Reinforcement Learning Based SelfConguring Integral Sliding Mode Control Scheme for Robot Manipulators,” in Proc. of 2018 IEEE CDC, pp. 59695974, 2018.
 [17] B. Demirel, A. Ramaswamy, D. E. Quevedo, and H. Karl, “DeepCAS: A Deep Reinforcement Learning Algorithm for ControlAware Scheduling,” IEEE Control System Letters, vol. 2, no. 4, pp. 737742, 2018.
 [18] D. Baumann, JJ. Zhu, G. Martius, and S. Trimpe, “Deep Reinforcement Learning for EventTriggered Control,” in Proc. of 2018 IEEE CDC, pp. 943950, 2018.
 [19] W. P. M. H. Heemels, K. H. Johansson, and P. Tabuada, “An Introduction to EventTriggered and SelfTriggered Control,” in Proc. of 2012 IEEE CDC, pp. 3270–3285, 2012.
 [20] J. Ikemoto and T. Ushio, “Application of Continuous Deep QLearning to Networked StateFeedback Control of Nonlinear Systems with Uncertain Network Delays,” in Proc. NOLTA2019, 2019.
 [21] W. Aangenent, D. Kostic, B. de Jager, R. van de Molengraft, and M. Steinbuch, “DataBased Optimal Control,” in Proc. of 2005 IEEE ACC, pp. 14601465, 2005.
 [22] D. P. Kingma and J. L. Ba, “ADAM: A Method for Stochastic Operation,” arXiv preprint arXiv:1412.6980, 2014.
 [23] G. E. Uhlenbeck and L. S. Ornstein, “On the Theory of the Brownian Motion,” Physical review, vol. 36, no. 5, pp. 823841, 1930.