Complete stability analysis of a heuristic ADP control design

# Complete stability analysis of a heuristic approximate dynamic programming control design

Yury Sokolov Department of Mathematical Sciences, The University of Memphis, United States Robert Kozma Department of Mathematical Sciences, The University of Memphis, United States Ludmilla D. Werbos IntControl LLC, Arlington, and CLION, The University of Memphis, USA  and  Paul J. Werbos CLION and the National Science Foundation (NSF), United States
###### Abstract.

This paper provides new stability results for Action-Dependent Heuristic Dynamic Programming (ADHDP), using a control algorithm that iteratively improves an internal model of the external world in the autonomous system based on its continuous interaction with the environment. We extend previous results for ADHDP control to the case of general multi-layer neural networks with deep learning across all layers. In particular, we show that the introduced control approach is uniformly ultimately bounded (UUB) under specific conditions on the learning rates, without explicit constraints on the temporal discount factor. We demonstrate the benefit of our results to the control of linear and nonlinear systems, including the cart-pole balancing problem. Our results show significantly improved learning and control performance as compared to the state-of-art.

###### Key words and phrases:

Yury Sokolov]ysokolov@memphis.edu Robert Kozma]rkozma@memphis.edu Ludmilla D. Werbos]l.dalmat@gmail.edu Paul J. Werbos]werbos@ieee.org

## 1. Introduction

Adaptive Dynamic Programming (ADP) addresses the general challenge of optimal decision and control for sequential decision making problems in real-life scenarios with complex and often uncertain, stochastic conditions without the presumption of linearity. ADP is a relatively young branch of mathematics; the pioneering work (Werbos, 1974) provided powerful motivation for extensive investigations of ADP designs in recent decades (Barto, Sutton Anderson, 1983; Werbos, 1992; Bertsekas Tsitsiklis, 1996; Si, Barto Powell Wunsch, 2004; Vrabie Lewis, 2009; Lendaris, 2009; Wang, Liu, Wei Zhao Jin, 2012; Zhang, Liu, Luo Wang, 2013). ADP has not only shown solid theoretical results to optimal control but also successful applications (Venayagamoorthy Harley Wunsch, 2003). Various ADP designs demonstrated powerful results in solving complicated real-life problems, involving multi-agent systems and games (Valenti, 2007; Al-Tamini Lewis Abu-Khalaf, 2007; Zhang Wei Liu, 2011).

The basic ADP approaches include heuristic dynamic programming (HDP), dual heuristic dynamic programming (DHP) and globalized DHP (GDHP) (Werbos, 1974, 1990; White Sofge, 1992; Prokhorov Wunsch, 1997). For each of these approaches there exists an action-dependent (AD) variation (White Sofge, 1992). For several important cases, the existence of stable solution for ADP control has been shown under certain conditions (Abu-Khalaf Lewis, 2005; Vrabie Lewis, 2009; Lewis Liu, 2012; Zhang, Zhang, Luo Liang, 2013).

The stability of ADP in the general case is an open and yet unsolved problem. There are significant efforts to develop conditions for stability in various ADP designs. We solved the stability problem for the specific ADHDP control case using the Lyapunov approach, which is a classical method of investigating stability of dynamical processes. Here we are addressing a discrete time dynamical system, where the dynamics is described by a difference equation. The discrete time Lyapunov function is used to prove the stability of the controlled process under certain conditions. In this paper we generalize the results of (Liu, Sun, Si, Guo Mei, 2012) for deriving stability conditions for ADHDP with traditional three layer Multi-Layer Perceptron (MLP). The work (Liu et al., 2012) derives a stability condition for the system with weights adapted between the hidden and output layers only, under the assumption that networks have large enough number of neurons in the hidden layers.

The approach presented in (Liu et al., 2012), in effect, is equivalent to a linear basis function approach: it is easy but it leads to scalability problems. The complexity of the system is growing exponentially for the required degree of approximation of a function of given smoothness (Barron, 1994). Additional problems arise regarding the accuracy of parameter estimation, which tends to grow with the number of parameters, all other factors are kept the same. If we have too many parameters for a limited set of data, it leads to overtraining. We need more parsimonious model, capable of generalization, hence our intention is to use fewer parameters in truly nonlinear networks, which is made possible by implementing more advanced learning algorithm. In the present work we focus on studying the stability properties of the ADP system with MLP-based critic, when the weights are adapted between all layers. By using Lyapunov approach, we study the uniformly ultimately bounded property of the ADHDP design. Preliminary results of our generalized stability studies have been reported in (Kozma Sokolov, 2013), where we showed that our general approach produced improved learning and convergence results, especially in the case of difficult control problems.

The rest of the paper is organized as follows. First we briefly outline theoretical foundations of ADHDP. Next we describe the learning algorithm based on gradient descent in the critic and action networks. This is followed by the statements and the proofs of our main results on the generalized stability criteria of the ADP approach. Finally, we illustrate the results using examples of two systems. The first one is a simple linear system used in (Liu et al., 2012), and the second example is the inverted pendulum system, similar to (He, 2011). We conclude the paper by outlining potential benefits of our general results for future applications in efficient real-time training and control.

## 2. Theoretical foundations of ADHDP control

### 2.1. Basic definitions

Let us consider a dynamical system (plant) with discrete dynamics, which is described by the following nonlinear difference equation:

 x(t+1)=f(x(t),u(t)), (2.1)

where is the -dimensional plant state vector and is the -dimensional control (or action) vector.

Previously we reported some stability results for ADP in the general stochastic case (Werbos, 2012). In this paper we focus on the deterministic case, as described in equation (2.1) and introduce action-dependent heuristic dynamic programming (ADHDP) to control this system. The original ADHDP method has been used in the 1990’s for various important applications, including the manufacturing of carbon-carbon composite parts (White Sofge, 1992). ADHDP is a learning algorithm for adapting a system made up of two components, the critic and the action, as shown in Fig. 1. These two major components can be implemented using any kind of differentiable function approximator. Probably the most widely used value function approximators in practical applications (as surveyed in Lewis and Liu, 2012) are neural networks, linear basis function approximators, and piecewise linear value functions such as those used by (Powell, 2011). In this work we use MLP as the universal function approximator.

The optimal value function, is the solution of the Bellman equation (White Sofge, 1992), which is a function of the state variables but not of the action variables. Here we use function , which is closely related to , but is a function of both the state and the action variables. Function is often denoted by in the literature, following the definition in (White Sofge, 1992, Chapter 3). The critic provides the estimate of function , which is denoted as . Function , used in traditional -learning (Si et al., 2004) is the discrete-variable equivalent of .

The action network represents a control policy. Each combination of weights defines a different controller, hence by exploring the space of possible weights we approximate the dynamic programming solution for the optimal controller. ADHDP is a method for improving the controller from one iteration to the next, from time instant to . We also have internal iterations, which are not explicit (Lewis Liu, 2012; He 2011). Namely, at a given , we update the weights of the neural networks using supervised learning for a specific number of internal iteration steps.

In ADHDP, the cost function is expressed as follows; see, e.g., (Lewis Liu, 2012):

 J(x(t),u(t))=∞∑i=tαi−tr(x(i+1),u(i+1)), (2.2)

where is a discount factor for the infinite horizon problem, and is the reward or reinforcement or utility function (He, 2011; Zhang, Liu Luo Wang, 2013). We require to be a bounded semidefinite function of the state and control , so the cost function is well-defined. Using standard algebra one can derive from (2.2) that , where .

### 2.2. Action network

Next we introduce each component, starting with the action component. The action component will be represented by a neural network (NN), and its main goal is to generate control policy. For our purpose, MLP with one hidden layer is used. At each time step this component needs to provide an action based on the state vector , so is used as an input for the action network. If the hidden layer of the action MLP consists of nodes; the weight of the link between the input node and the hidden node is denoted by , for and . , where , is the weight from s hidden node to s output. The weighted sum of all inputs, i.e., the input to a hidden node is given as . The output of hidden node of the action network is denoted by .

For neural networks a variety of transfer functions are in use, see, e.g. (Zhang, Liu Luo Wang, 2013). Hyperbolic tangent is a common transfer function, which is used here: . A major advantage of the standard MLP neural network described here is the ability to approximate smooth nonlinear functions more accurately than linear basis function approximators, as the number of inputs grows (Barron, 1993; 1994). Finally, the output of the action MLP is a -dimensional vector of control variables . The diagram of the action network is shown in Fig. 2.

### 2.3. Critic network

The critic neural network, with output , learns to approximate function and it uses the output of the action network as one of its inputs. This is shown in Fig. 3. The input to the critic network is , where is output of the action network. Just as for the action NN, here we use an MLP with one hidden layer, which contains nodes. , for and is the weight from s input to s hidden node of the critic network. Here hyperbolic tangent transfer function is used. For convenience, the input to a hidden node is split in two parts with respect to inputs . The output of hidden node of the critic network is given as . Since the critic network has only one output, we have weights between hidden and output layers of the form . Finally, the output of the critic neural network can be described in the form , where denotes the inner product.

### 3.1. Adaptation of the critic network

Let be the prediction error of the critic network and be the objective function, which must be minimized. Let us consider gradient descent algorithm as the weight update rule, that is, . Here the last term is and is the learning rate.

By applying the chain rule, the adaptation of the critic network’s weights between input layer and hidden layer is given as follow , which yields

 ∂Ec(t)∂^w(1)cij(t) = ∂Ec(t)∂^J(t)∂^J(t)∂ϕci(t)∂ϕci(t)∂σci(t)∂σci(t)∂^w(1)cij(t)= (3.1) αec(t)^w(2)ci(t)[12(1−ϕ2ci(t))]yj(t).

The last calculation is obtained with respect to the main HDP (and ADHDP) paradigm, which treats at different time steps as different functions; see e.g., (Lewis Liu, 2012; Werbos, 2012). Application of the chain rule for the adaptation of the critic network’s weights between hidden layer and output layer yields , which leads to

 ∂Ec(t)∂^w(2)ci(t)=∂Ec(t)∂^J(t)∂^J(t)∂^w(2)ci(t)=αec(t)ϕci(t). (3.2)

### 3.2. Adaptation of the action network

The training of the action network can be done by using the backpropagated adaptive critic method (White Sofge, 1992), which entails adapting the weights so as to minimize . In this paper we used an importance-weighted training approach. We denote by the desired ultimate objective function. Then the minimized error measure is given in the form , where is the prediction error of the action NN.

In the framework of the reinforcement learning paradigm, the success corresponds to an objective function, which is zero at each time step (Barto, Sutton Anderson, 1983). Based on this consideration and for the sake of simplicity of the further derivations, we assume , that is, the objective function is zero at each time step, i.e. there is success.

Let us consider gradient descent algorithm as the weight update rule similarly as we did for the critic network above. That is, , where and is the learning rate.

By applying the chain rule, the adaptation of the action network’s weights between input layer and hidden layer is given as ,

 ∂Ea(t)∂^w(1)aij(t)=∂Ea(t)^J(t)[∂^J(t)∂u(t)]T∂u(t)∂ϕai(t)∂ϕai(t)∂σai(t)∂σai(t)∂^w(1)aij(t)= ∂Ea(t)^J(t)n∑k=1∂^J(t)∂uk(t)∂uk(t)∂ϕai(t)∂ϕai(t)∂σai(t)∂σai(t)∂^w(1)aij(t)= ^J(t)n∑k=1Nhc∑r=1[^w(2)cr(t)12(1−ϕ2cr(t))^w(1)cr,m+k(t)]×^w(2)aki(t)12(1−ϕ2ai(t))xj(t), (3.3)

where

 ∂^J(t)∂uk(t)=Nhc∑i=1∂^J(t)∂ϕci(t)∂ϕci(t)∂σci(t)∂σci(t)∂uk(t). (3.4)

Using similar approach for the action network’s weights between hidden layer and output layer, finally we get the following ,

 ∂Ea(t)∂^w(2)akj(t)=∂Ea(t)^J(t)∂^J(t)∂uk(t)∂uk(t)∂^w(2)akj(t)= ea(t)Nhc∑r=1[^w(2)cr(t)12(1−ϕ2cr(t))^w(1)cr,m+k(t)]ϕaj(t). (3.5)

## 4. Lyapunov stability analysis of ADHDP

In this section we employ Lyapunov function approach to evaluate the stability of dynamical systems. The applied Lyapunov analysis allows to establish the UUB property without deriving the explicit solution of the state equations.

### 4.1. Basics of the Lyapunov approach

Let denote the optimal weights, that is, the following holds: ; we assume that the desired ultimate objective corresponds to success then .

Consider the weight estimation error over full design, that is, over both critic and action networks of the following form: . Then equations (3.1), (3.2), (3.3) and (3.5) define a dynamical system of estimation errors for some nonlinear function in the following form

 ~w(t+1)=~w(t)−F(^w(t−1),^w(t),ϕ(t−1),ϕ(t)). (4.1)
###### Definition 1.

A dynamical system is said to be uniformly ultimately bounded with ultimate bound , if for any and , there exists a positive number independent of , such that for all whenever .

In the present study, we make use of a theorem concerning the UUB property (Sarangapani, 2006). Detailed proof of this theorem appears in (Michel  Hou  Liu, 2008). We adapt the notation for our situation and address the special case of a discrete dynamical systems as given in (4.1).

###### Theorem 1.

(UUB property of a discrete dynamical system) If, for system (4.1), there exists a function such that for all in a compact set , is positive definite and the first difference, for , for some , such that -neighborhood of is contained in , then the system is UUB and the norm of the state is bounded to within a neighborhood of .

Based on this theorem, which gives a sufficient condition, we can determine the UUB property of the dynamical system selecting an appropriate function . For this reason, we first consider all components of our function candidate separately and investigate their properties, and thereafter we study the behavior of function to match the condition from Theorem 1.

### 4.2. Preliminaries

In this subsection we introduce four lemmas which will be used in the proof of our main theorem.

###### Assumption 1.

Let and be the optimal weights for action and critic networks. Assume they are bounded, i.e., and .

###### Lemma 1.

Under Assumption 1, the first difference of is expressed by

 (4.2)

where is the approximation error of the output of the critic network.

###### Proof.

(Lemma 1). Using (3.2) and taking into account that does not depend on , i.e., it is optimal for each time moment , we get the following

 ~w(2)c(t+1)=^w(2)c(t+1)−w(2)c∗=~w(2)c(t)−αlcϕc[α^w(2)c(t)ϕc(t)+r(t)−^w(2)c(t−1)ϕc(t−1)]T. (4.3)

Based on the last expression, we can find the trace of multiplication of by itself in the following way:

 tr[(~w(2)c(t+1))T~w(2)c(t+1)]=(~w(2)c(t))T~w(2)c(t)− 2αlc~w(2)c(t)ϕc(t)[α^w(2)c(t)ϕc(t)+r(t)−^w(2)c(t−1)ϕc(t−1)]T+ α2l2c∥ϕc(t)∥2∥∥α^w(2)cϕc(t)+r(t)−^w(2)c(t−1)ϕc(t−1)∥∥2. (4.4)

Since is a scalar, we can rewrite the middle term in the above formula as follows:

 −2αlc~w(2)c(t)ϕc(t)[α^w(2)c(t)ϕc(t)+r(t)−^w(2)c(t−1)ϕc(t−1)]= lc(∥∥α^w(2)c(t)ϕc(t)+r(t)−^w(2)c(t−1)ϕc(t−1)−α~w(2)c(t)ϕc(t)∥∥2− ∥∥α~w(2)c(t)ϕc(t)∥∥2−∥∥α^w(2)c(t)ϕc(t)+r(t)−^w(2)c(t−1)ϕc(t−1)∥∥2)= lc(∥∥αw∗(2)cϕc(t)+r(t)−^w(2)c(t−1)ϕc(t−1)∥∥2−α2∥ζc(t)∥2− ∥∥α^w(2)c(t)ϕc(t)+r(t)−^w(2)c(t−1)ϕc(t−1)∥∥2). (4.5)

Here the definition of is applied to obtain the above expression.

Now let us consider the first difference of in the form

 ΔL1(t)=1lc[(~w(2)c(t+1))T~w(2)c(t+1)−(~w(2)c(t))T~w(2)c(t)]. (4.6)

Substituting the results for , finally we get the statement of the lemma, as required. ∎

###### Lemma 2.

Under Assumption 1, the first difference of is bounded by

 4∥ζc(t)∥2+4∥∥w∗(2)cϕc(t)∥∥2+∥∥^w(2)c(t)C(t)ζa(t)∥∥2), (4.7)

where is the approximation error of the action network output and is a weighting factor; is the matrix with coefficients , where , and .

###### Proof.

(Lemma 2). Let us consider the weights from the hidden layer to output layer of the action network which are updated according to (3.5)

 ~w(2)a(t+1)=^w(2)a(t+1)−w∗(2)a=^w(2)a(t)−laϕa(t)^w(2)c(t)C(t)[^w(2)c(t)ϕc(t)]T−w∗(2)a=~w(2)a(t)−laϕa(t)^w(2)c(t)C(t)[^w(2)c(t)ϕc(t)]T. (4.8)

Based on this expression, it is easy to see that

 tr[(~w(2)a(t+1))T~w(2)a(t+1)]=(~w(2)a(t))T~w(2)a(t)+l2a∥ϕa(t)∥2∥∥^w(2)c(t)C(t)∥∥2∥∥^w(2)c(t)ϕc(t)∥∥2−2la^w(2)c(t)C(t)[^w(2)c(t)ϕc(t)]Tζa(t). (4.9)

Here the last formula is based on the assumption that all vector multiplications are under trace function.

Now let us consider the first difference of function , that is, the following expression

 ΔL2(t)=1laγ1tr[(~w(2)a(t+1))T~w(2)a(t+1)−(~w(2)a(t))T~w(2)a(t)]. (4.10)

After substituting the appropriate terms in the last formula, we get

 ΔL2(t) = 1γ1(la∥ϕa(t)∥2∥∥^w(2)c(t)C(t)∥∥2∥∥^w(2)c(t)ϕc(t)∥∥2 (4.11) − 2^w(2)c(t)C(t)[^w(2)c(t)ϕc(t)]Tζa(t)).

Consider the last term of (4.11)

 −2^w(2)c(t)C(t)[^w(2)c(t)ϕc(t)]Tζa(t)=∥∥^w(2)c(t)ϕc(t)−^w(2)c(t)C(t)ζa(t)∥∥2− ∥∥^w(2)c(t)C(t)ζa(t)∥∥2−∥∥^w(2)c(t)ϕc(t)∥∥2. (4.12)

After substituting this formula into , we get

 ΔL2(t)=1γ1(la∥ϕa(t)∥2∥∥^w(2)c(t)C(t)∥∥2∥∥^w(2)c(t)ϕc(t)∥∥2+ (4.13)

Notice that

 ∥∥^w(2)c(t)ϕc(t)−^w(2)c(t)C(t)ζa(t)∥∥2−∥∥^w(2)c(t)C(t)ζa(t)∥∥2≤ 2∥∥^w(2)c(t)ϕc(t)∥∥2+∥∥^w(2)c(t)C(t)ζa(t)∥∥2≤ 2∥∥(~w(2)c(t)+w∗(2)c)ϕc(t)∥∥2+∥∥^w(2)c(t)C(t)ζa(t)∥∥2≤ 2(∥∥~w(2)c(t)ϕc(t)∥∥+∥∥w∗(2)cϕc(t)∥∥)2+∥∥^w(2)c(t)C(t)ζa(t)∥∥2≤ 4∥ζc(t)∥2+4∥∥w∗(2)cϕc(t)∥∥2+∥∥^w(2)c(t)C(t)ζa(t)∥∥2. (4.14)

Finally we get the following bound for , as required:

 4∥ζc(t)∥2+4∥∥w∗(2)cϕc(t)∥∥2+∥∥^w(2)c(t)C(t)ζa(t)∥∥2). (4.15)

###### Remark 1.

If we introduce the following normalization for the network’s weights and fix the weights of the input layer, then applying Lemmas 1 and 2, we can readily obtain the results given by (Liu et al., 2012).

###### Lemma 3.

Under Assumption 1, the first difference of is bounded by

 ΔL3(t)≤1γ2(α2lc∥∥α^w(2)c(t)ϕc(t)+r(t)−^w(2)c(t−1)ϕc(t−1)∥∥2∥a(t)∥2∥y(t)∥2+ α∥∥~w(1)c(t)y(t)aT(t)∥∥2+α∥∥α^w(2)c(t)ϕc(t)+r(t)−^w(2)c(t−1)ϕc(t−1)∥∥2), (4.16)

where is a weighting factor and is a vector, with for .

###### Proof.

(Lemma 3). Let us consider the weight update rule of the critic network between input layer and hidden layer in the form

 ^w(1)c(t+1)=^w(1)c(t)−αlc(α^w(2)c(t)ϕc(t)+r(t)−^w(2)c(t−1)ϕc(t−1))TB(t), (4.17)

where , for .

Following the same approach as earlier, we can express by

 ~w(1)c(t+1)=^w(1)c(t+1)−w∗(1)c=~w(1)c(t)− αlc(α^w(2)c(t)ϕc(t)+r(t)−^w(2)c(t−1)ϕc(t−1))TB(t). (4.18)

For convenience, we introduce the following notation . Then the trace of multiplication can be written as

 tr[(~w(1)c(t+1))T~w(1)c(t+1)]=(~w(1)c(t))T~w(1)c(t)+ α2l2c∥∥α^w(2)c(t)ϕc(t)+r(t)−^w(2)c(t−1)ϕc(t−1)∥∥2BT(t)B(t)− 2αlc(α^w(2)c(t)ϕc(t)+r(t)−^w(2)c(t−1)ϕc(t−1))BT(t)~w(1)c(t). (4.19)

Using the property of trace function, that is, the following , we can express the last term of (4.19) as follows:

 −2αlc(α^w(2)c(t)ϕc(t)+r(t)−^w(2)c(t−1)ϕc(t−1))y(t)aT(t)