Leader-based Optimal Coordination Control for the Consensus Problem of Multi-agent Differential Games via Fuzzy Adaptive Dynamic Programming

# Leader-based Optimal Coordination Control for the Consensus Problem of Multi-agent Differential Games via Fuzzy Adaptive Dynamic Programming

Huaguang Zhang, Jilie Zhang, Guang-Hong Yang, and Yanhong  Luo Huaguang Zhang, Jilie Zhang, Guang-Hong Yang and Yanhong Luo are with the College of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110819, P. R. China (hgzhang@ieee.org; jilie0226@163.com; yangguanghong@ise.neu.edu.cn; neuluo@163.com). Huaguang Zhang and Guang-Hong Yang are also with State Key Laboratory of Synthetical Automation for Process Industries (Northeastern University), Shenyang, Liaoning, 110004, China.
###### Abstract

In this paper, a new online scheme is presented to design the optimal coordination control for the consensus problem of multi-agent differential games by fuzzy adaptive dynamic programming (FADP), which brings together game theory, generalized fuzzy hyperbolic model (GFHM) and adaptive dynamic programming (ADP). In general, the optimal coordination control for multi-agent differential games is the solution of the coupled Hamilton-Jacobi (HJ) equations. Here, for the first time, GFHMs are used to approximate the solutions (value functions) of the coupled HJ equations, based on policy iteration (PI) algorithm. Namely, for each agent, GFHM is used to capture the mapping between the local consensus error and local value function. Since our scheme uses the single-network architecture for each agent (which eliminates the action network model compared with dual-network architecture), it is a more reasonable architecture for multi-agent systems. Furthermore, the approximation solution is utilized to obtain the optimal coordination control. Finally, we give the stability analysis for our scheme, and prove the weight estimation error and the local consensus error are uniformly ultimately bounded (UUB). Further, the control node trajectory is proven to be cooperative uniformly ultimately bounded (CUUB).

Optimal coordination control, Consensus problem, Multi-agent differential game, Fuzzy adaptive dynamic programming, Generalized fuzzy hyperbolic model.

## I Introduction

In recent decades, the consensus problems of multi-agent systems (for instance, formation control [1], flocking [2, 3], rendezvous [4] and sensor networks [5, 6] and so on) have received considerable attention, such as [7, 8] and [9]. In the early days, consensus problems originated from computer science and formed the foundation of the field of distributed computing [10]. Subsequently, these problems were developed to management science and statistics [11]. Now, references [12] and [13] in 1980s are referred to as the pioneering work on consensus problems for control theory. In [14], Olfati-Saber and Murray presented the fundamental framework for solving consensus problems for multi-agent systems. And overviews [15] and [16] have summarized the recent achievements of coordination control for consensus problems of multi-agent systems.

In [15], for consensus problems, Ren et al. proposed an open research problem, that is, how to design the optimal coordination control, which not only makes multi-agent systems stable, but also minimizes their performance indexes. In a physical sense, the optimal coordination control makes every agent use up the least amount of energy, and makes them reach a consensus. In fact, every agent depends on the actions of itself and all its neighborhood agents. Therefore, every agent requires to choose a control to minimize its own performance index by acting on itself, according to the outcomes of its neighborhood agents. It is similar to the multi-player cooperative game.

Game theory [17] studies strategic decision making problems. More formally, it is “the study of mathematical models of conflict and cooperation between intelligent rational decision-makers.” In general, if it is cooperative games, the communication among players is allowed. The decision for each player depends on the actions of himself and all the other players. In the early days, game theory was used widely for solving the problem of multi-player games, such as, [18] and [19]. Recently, game theory has also become the theoretical basis in the field of multi-agent games in [20, 21, 22]. The evolution of the agents’ state variables is governed by differential equations. The problem of finding an optimal strategy in a differential game is closely related to the optimal control theory. In particular, the closed-loop strategies can be found by Bellman’s dynamic programming method, such as [18, 19, 20]. For multi-agent systems, since every agent’s action depends on the outcomes of itself and all the neighborhood agents, the coupled Hamilton-Jacobi (HJ) equations are set up. Therefore, for multi-agent differential games, the optimal coordination control relies on solving the coupled HJ equations. However, in general, it is very difficult.

Therefore, in this paper, ADP algorithm ([23] and [24]), which combines adaptive control and reinforcement learning, is introduced to learn the solution of HJ equations online for multi-agent systems. The excellent overview of the state-of-the-art developments of ADP algorithm has been presented in [25, 26, 28, 27]. How to approximate the value function is a key problem in the ADP algorithm. Based on Weierstrass higher-order approximation theorem [29], we know that complete basis can be used to approximate the solution of the Hamilton-Jacobi-Bellman (HJB) equation by linear expression, as . For finite , however, the approximation theorem will be sensitive to the chosen basis. If a smooth function can not be spanned by finite independent basis sets, then the group of basis sets will not be able to strictly approximate the function. Therefore, we want to choose a group of independent basis as better as possible to capture the significant features of the value function. Traditionally, neural networks are used as the approximator for it. However, neural networks do not have the clear physical significance, and activation functions (basis functions) are manually chosen. So we do not know whether the selected activation function is appropriate. It motives us to circumvent the disadvantage by using fuzzy approximation technology (fuzzy approximator). The fuzzy approximation technology can characterize the value function more reasonably by the knowledge from human experts and experiments. The generalized fuzzy hyperbolic model (GFHM) is a better selection as a function approximator [32, 30, 31] which has clear physical significance [35] (It is easy to construct an GFHM if we know some linguistic information about the relationship between function output and the input variables), and the model weights can be optimized by adaptive learning. Specially, GFHM transforms the problem (that is, how to choose basis functions in neural network model) into how to translate the input variables. In this way, the entire input space can be covered as much as possible by choosing sufficient and proper generalized input variables. So, GFHM is a better approximator for estimating the value function, such as [33] and [34].

In recent years, some optimal control methods have been proposed for the multi-agent consensus problem, such as the linear quadratic regulator (LQR) technology [36] and the model predictive control (MPC) technology [37]. However, the method in[36] is only limited to the linear systems and is an off-line design procedure. Though the method in [37] has obtained a good on-line controller for single- and double-integrator multi-agent systems (specially, the time-varying communication network), the continuous sampling and real-time predictive processes are required, and the method gets a control sequence for the finite horizon. By the way, [37] addresses the case that agents are discrete-time systems with leaderless. Here, we deal with the continuous nonlinear consensus problem with the leader online through using the ADP algorithm. The algorithm can solve the coupled HJ equations directly by the policy iteration and adaptive control methods, and simultaneously avoiding the sampling and repeated predictive processes in [37]. In addition, we get an optimal function relationship of control for the infinite horizon, when the ADP algorithm does not change the adjustable weight of control.

In this paper, our major idea is to utilize game theory to solve the optimal coordination control problem for multi-agent systems based on adaptive dynamic programming. By Bellman’s dynamic programming method, we construct the coupled HJ equations for multi-agent differential games. To obtain the solution of the coupled HJ equations, GFHMs are used to approximate the value functions (solution) under the framework of PI algorithm [38]. It results in the errors of the coupled HJ equations. To minimize the errors resulting from GFHM approximators, the gradient descent is used to update weights of these GFHM approximators. The update of weights is implemented continuously until they do not change. We call it fuzzy adaptive dynamic programming (FADP). Finally, we analyze the stability conditions and prove the weight error and the local consensus error are uniformly ultimately bounded (UUB).

The contributions of the paper include:

1. The cooperative problem of multi-player games is developed to the coordination consensus control problem of nonlinear multi-agent systems. The paper builds a relationship between the optimal consensus problem for multi-agent systems and Nash equilibrium of cooperative game theory.

2. The coupled Hamilton-Jacobi equations for multi-agent systems are established by Bellman’s dynamic programming, and then the stability analysis is developed for our scheme.

3. The open problem, i.e., the optimal consensus problem for multi-agent systems presented in [15], is solved by fuzzy adaptive dynamic programming with single-network architecture for the first time. Namely, only one GFHM is used to approximate the local value function for each agent.

4. The proposed single-network architecture eliminates the action network model and reduces the number of updated weights, compared with the dual-network architecture (that in [23] and [39]).

The rest of this paper is organized as follows. In section II, some definitions and notions are given. The local consensus dynamic error system is established in section III. The coupled Hamilton-Jacobi equations for multi-agent systems are deduced, the stability of Nash equilibrium is proven and the coupled HJ equations are solved by PI algorithm in section IV. Section V derives the approximation coupled HJ equations by using GFHMs. SectionVI gives stability analysis for our scheme and proves the weight estimation error and the local consensus error are UUB, and the control node trajectory is CUUB. Finally, a numerical example is given to illustrate the effectiveness of our scheme.

## Ii Preliminaries

The purpose of this section is to provide the foundations of graph theory, information consensus and generalized fuzzy hyperbolic model.

### Ii-a Graph Theory

In this paper, graph theory is used to analyse the multi-agent systems as a very helpful mathematical tool. Regardless of the unidirectional information flow or bidirectional one, the topology of a communication network can be expressed by a weighted graph.

Let be a weighted graph of nodes with the nonempty finite set of nodes , where set of edges belongs to the product space of (i.e. ), an edge of is denoted by , which is a direct path from node to node , and is a weighted adjacency matrix with nonnegative adjacency elements, i.e., , , otherwise . The node index belongs to a finite index set .

###### Definition 1 (Laplacian Matrix)

The graph Laplacian matrix is defined as , with being the in-degree matrix of graph, where is in-degree of node in graph.

###### Remark 1

Laplacian matrix has all row sums equal to zero.

In this paper, we assume the graph is simple, e.g. no repeated edges and no self loops. The set of neighbors of node is denoted by . A graph is referred to as a spanning tree, if there is a node (called the root), such that there is a directed path from the root to any other nodes in the graph. A digraph is said to be strongly connected, if there is a directed path from node to node , for all distinct nodes . A digraph has a spanning tree if it is strongly connected, but not vice versa.

Here, we focus on the strongly connected communication digraph with fixed topology.

### Ii-B Consensus for Networks of Agents

A multi-agent system is a network which consists of a group of agents. Every agent is called as a node in network. Let denote the state of node . We call (with the state ) a network (or algebraic graph), where . The state of a node might represent the physical quantity of the agent, such as altitude, velocity, angle, voltage and so on. We say nodes of a network have reached a consensus if and only if for all . For the consensus problem with leader, every node requires , , where is state trajectory of the leader.

### Ii-C Generalized Fuzzy Hyperbolic Model

###### Definition 2

Given a plant with input variables and an output variable . We call the fuzzy rule base the generalized fuzzy hyperbolic rule base if it satisfies the following conditions:

1. The fuzzy rule takes the following form :

 Rl: IF (x1−d11) is Fx11,…,(x1−d1w1) is Fx1w1,(x2−d21)is Fx21,…, (x2−d2w2) is Fx2w2,…% ,(xn−dn1) is Fxn1,…, and(xn−dnwn) is Fxnwn. THEN  yl=cF11+…+cF1w1+…+cFn1+…+cFnwn,

where, represents the number of transformations associated with each , and are constants that define the transformations, are fuzzy sets of which include subsets (positive) and (negative), and are constants corresponding to .

2. The constants in the THEN-part correspond to in the IF-part. That is, if there is in the IF-part, must appear in the THEN-part; otherwise, does not appear in the THEN-part.

3. There are fuzzy rules in the rule base, where that is, all the possible and combinations of input variables in the IF-part and all the linear combinations of constants in the THEN-part.

###### Lemma 1

[30, 32] For a multiple input single output system, , define the generalized input variables as

 ¯xi=xz−dzj,i=1,…,m

and the generalized fuzzy hyperbolic rule base as in Definition 2, respectively, where the membership functions of the generalized input variables and are defined as

 μPz(xz)=e−12(xz−ϕz)2,μNz(xz)=e−12(xz+ϕz)2,

where .

We can then derive the following model:

 y=θTtanh(Φ¯x)+ζ,

where is an ideal vector; with () and ; and is a constant scalar. We call it as generalized fuzzy hyperbolic model (GFHM).

###### Lemma 2

[30] Let be the set of all generalized fuzzy hyperbolic model given by Lemma 1. For any given real continuous function on the compact set and any arbitrary , there exists a such that

 supx∈U|f(x)−h(x)|<δ.
###### Remark 2

Lemma 2 shows that GFHM can uniformly approximate any nonlinear function over to any degree of accuracy if is compact, that is, the GFHM is a universal approximator (see [30] for details). Therefore, GFHM can approximate the function with error bound, by sufficient and proper generalized input variables which cover the entire space as much as possible. Here, the sufficient and proper translational quantity of input variables requires to be chosen by expertise or manual selection.

## Iii Consensus error dynamic system

Consider multi-agent systems with agents in the form of communication network . Their node dynamics are

 ˙xi=f(xi)+gi(xi)ui, (1)

where is the state of node , is the input coordination control. and , such that and contains the origin (, is the Euclidean norm).

The global network dynamics is

 ˙x=f(x)+g(x)u, (2)

where the global state vector of the multi-agent system (2) is , the global nodedynamics vector is , with and the global control input (). is the number of the nodes.

The state of the control node (or leader) is which satisfies the dynamics

 ˙x0=f(x0), (3)

where , is the differentiable function.

The local neighborhood consensus error for node is defined as

 ei=∑j∈Niaij(xi−xj)+bi(xi−x0), (4)

where (). is the pinning gain (). Note that for at least one . Then if and only if there is not a direct path from the control node to the node in ; otherwise . The nodes () are referred to as the pinned or controlled nodes.

###### Remark 3

The local neighborhood consensus error represents the information whether node agrees on the leader and its neighbors, that is, whether the multi-agent system reach a consensus, as .

The global error vector for the network is

 e= (L⊗In)x+(B⊗In)(x−x––0) = (L⊗In)x−(L⊗In)x––0+(B⊗In)(x−x––0) = ((L+B)⊗In)(x−x––0) = L(x−x––0), (5)

with ( is an identity matrix with dimensions), where is the Laplacian matrix for the network ; and , with and is the N-vector of ones; is a diagonal matrix with diagonal entries (i.e. ). is the Kronecker product operator. Differentiating (4) or (III), the dynamics of local neighborhood consensus error for network are given by

 ˙ei= ((Li+Bi)⊗In)(˙x−˙x––0) = ((Li+Bi)⊗In)(f(x)+g(x)u−f––(x0)) = ((Li+Bi)⊗In)(fe(t)+g(x)u) = ∑j∈I((lij+bij)⊗In)(fej(t)+gj(xj)uj) = ((lii+bii)⊗In)(fei(t)+gi(xi)ui)+∑j∈Ni((lij+bij)⊗In)(fej(t)+gj(xj)uj), (6)

where with , and . is denoted as a row vector which is the row vector of the Laplacian matrix , that is, . Similarly, .

###### Remark 4

Since is zero when the node is not the neighbor of node , the expressions (III) only contain control inputs of all the neighbors of node and itself in network . In fact, it denotes that the local neighborhood consensus error depends on the states and the control inputs from node and all of its neighbors.

###### Definition 3

(Uniformly Ultimately Bounded (UUB)) The local neighborhood consensus error is uniformly ultimately bounded (UUB) if there exists a compact set so that there exists a bound and a time , both independent of , such that .

###### Definition 4

(Cooperative Uniformly Ultimately Bounded (CUUB))[40] The control node trajectory given by (3) is cooperative uniformly ultimately bounded (CUUB) with respect to solutions of node dynamics (1) if there exists a compact set so that , there exist a bound and a time , both independent of , such that .

## Iv Optimal Coordination Control

To reach a consensus while simultaneously minimizing the local performance index of every agent, we use the machinery of -person cooperative games ([19, 20]) to design the optimal coordination control for the systems (III).

### Iv-a The Coupled HJ Equation

Define the local performance indexes (cost functionals) by

 Ji(ei(0),ui,u(j))= ∫∞0ri(ei,ui,u(j))dt = ∫∞0(eTiQiiei+∑j∈IuTjRijuj)dt = ∫∞0(eTiQiiei+uTiRiiui+∑j∈I,j≠iuTjRijuj)dt = ∫∞0(eTiQiiei+uTiRiiui+∑j∈NiuTjRijuj)dt, (7)

with . are the control input vectors of the neighbors of node .

All weighting matrices are constant and satisfy , and . Note that if is the control inputs of the neighbors of node , then , vice versa. Otherwise, . In other words, the performance index depends on the input information of node and its neighbors.

###### Problem 1

The problem required to be solved is that how to design the local optimal coordination control to minimize the local performance indexes (IV-A) subject to (III) and make all nodes (agents) reach a consensus on the control node (leader).

###### Definition 5 (Admissible Coordination Control Policies)

[20] Controls () are defined as admissible coordination control policies if coordination controls () not only stabilize the systems (III) on locally, but also make the local cost functional (IV-A) finite.

Under the given admissible coordination control policies and , the local value function for node is defined by

 Vi(ei(t))= ∫∞tri(ei,ui,u(j))dt = ∫∞t(eTiQiiei+uTiRiiui+∑j∈NiuTjRijuj)dt,

and the local coupled nonlinear Lyapunov equations for (III) are

 0= Hi(ei,Vei,ui,u(j)) ≡ ri(ei,ui,u(j))+VTei((Li+Bi)⊗In)(fe(t)+g(x)u) = eTiQiiei+uTiRiiui+∑j∈NiuTjRijuj+VTeiLi(fe(t)+g(x)u), (9)

with . is the partial derivative of the value function with respect to .

Meanwhile, the local coupled Hamiltonians of Problem 1 are defined by

 Hi(ei,Vei,ui,u(j))= ri(ei,ui,u(j))+VTeiLi(fe(t)+g(x)u) = eTiQiiei+uTiRiiui+∑j∈NiuTjRijuj+VTeiLi(fe(t)+g(x)u). (10)

According to the necessary condition of optimality principle, we can obtain

 ui= −12R−1ii(∂uT∂ui)gT(x)LTiVei = −12R−1iigTi(xi)((lii+bii)⊗In)TVei. (11)

Assume that the local optimal value functions satisfy the coupled HJ equations

 minuiHi(ei,V∗ei,ui,u(j)) = 0, (12)

then, the local optimal coordination controls are

 u∗i=−12R−1iigTi(xi)((lii+bii)⊗In)TV∗ei. (13)

Inserting and to (IV-A), we can obtain

 0= eTiQiiei+u∗TiRiiu∗i+∑j∈Niu∗TjRiju∗j+V∗TeiLi(fe(t)+g(x)u∗) = eTiQiiei+14V∗Tei((lii+bii)⊗In)gi(xi)R−1iigTi(xi)((lii+bii)⊗In)TV∗ei +14∑j∈NiV∗Tej((ljj+bjj)⊗In)gj(xj)R−1jjRijR−1jjgTj(xj)((ljj+bjj)⊗In)TV∗ej+V∗TeiLi(fe(t)+g(x)u∗).

We can rewrite it as the coupled HJ equations (see Appendix A)

 0= eTiQiiei+14V∗Tei((lii+bii)⊗In)gi(xi)R−1iigTi(xi)((lii+bii)⊗In)TV∗ei +14∑j∈Ni(V∗Tej((ljj+bjj)⊗In)gj(xj)R−1jjRijR−1jjgTj(xj)((ljj+bjj)⊗In)TV∗ej) +V∗Tei((lii+bii)⊗In)(fei(t)+gi(xi)u∗i)+V∗Tei∑j∈Ni((lij+bij)⊗In)(fej(t)+gj(xj)u∗j). (14)

Inserting (13) to (IV-A), we can get

 0= eTiQiiei−12V∗Tei((lii+bii)⊗In)gi(xi)R−1iigTi(xi)((lii+bii)⊗In)TV∗ei +14∑j∈NiV∗Tej((ljj+bjj)⊗In)gj(xj)R−1jjRijR−1jjgTj(xj)((ljj+bjj)⊗In)TV∗ej +V∗Tei∑j∈{Ni,i}((lij+bij)⊗In)fej(t)−12V∗Tei∑j∈Ni((lij+bij)⊗In)gj(xj)R−1jjgTj(xj)((ljj+bjj)⊗In)TV∗ej. (15)

Note that the optimal value functions () are the solution of equations (IV-A). The optimal coordination controls (13) can be obtained by . In fact, the solution of equations (IV-A) is a Nash equilibrium. Their relationship will be introduced in the next section.

### Iv-B Nash Equilibrium

First, according to [17], we introduce the Nash equilibrium definition for multi-player games.

###### Definition 6 (Global Nash Equilibrium)

An -tuple of control policies is referred to as a global Nash equilibrium solution for an -player game (graph ) if for all

 J∗i≜ Ji(u∗1,u∗2,…,u∗i,…,u∗N) ≤ Ji(u∗1,u∗2,…,ui,…,u∗N),(ui≠u∗i).

The -tuple of the local performance values is known as a Nash equilibrium of the -player game (graph ).

Then, two important facts are obtained by Theorem 1 below, that is, the conclusions (I) and (II).

###### Theorem 1

Let , be a solution to coupled HJ equations (IV-A), the optimal coordination control policies () be given by (13) in term of these solutions . Then

(I)

The local neighborhood consensus error systems (III) are asymptotically stable.

(II)

The local performance values are equal to , ; and and are in Nash equilibrium.

{proof}

First, the conclusion (I) is proven. Under the conditions, the local optimal value functions satisfy (IV-A) then they also satisfy (IV-A). Take the time derivative of

 ˙V∗i(ei)= V∗Tei˙ei = V∗TeiLi(fe(t)+g(x)u∗) = −eTiQiiei−u∗TiRiiu∗i−∑j∈Niu∗TjRiju∗j.

Since and . Therefore, is a Lyapunov function for . Furthermore, the local neighborhood consensus error system (III) is asymptotically stable.

The conclusion (II) is obvious, according to the definition of performance index, value function and Definition 6.

###### Remark 5

In Theorem 1, the part (II) states the fact that the solution of the equation set (IV-A) is the Nash equilibrium. Note that the solution of (IV-A) is not unique. In general, there exist multiple Nash equilibrium. In fact, in ADP field, the obtained optimal solution is the local optimum [46]. The globally optimal solution can not be obtained unless we explore the entire state space. However, in general, it is not possible.

Obviously, if only the coupled HJ equations (IV-A) can be solved, we will obtain the Nash equilibrium for multi-agent systems. However, due to the nonlinear nature of the coupled HJ equations (IV-A), obtaining its analytical solution is generally difficult. Therefore, in the next section, the policy iteration algorithm is used to solve the coupled HJ equations.

### Iv-C Policy Iteration (PI) Algorithm for the Coupled HJ Equations

In general, equations (IV-A) are difficult or impossible to be solved. In the field of ADP and reinforcement learning, PI algorithm is usually used to obtain the solution of the HJB equation. Similarly, we solves the coupled HJ equations by PI algorithm, which relies on repeated policy evaluation (e.g. the solution of (IV-A)) and policy improvement (the solution of (IV-A)). The iteration process is implemented until the result of policy improvement no longer changes. If controls of all the nodes () do not change under the framework of PI algorithm, then they are the solution (Nash equilibrium) of the coupled HJ equations (12) or (IV-A). However, it is necessary that the initial local coordination control policies must be admissible control policies in PI algorithm.

###### Step 1

(Policy Evaluation) Given the -tuple of policies , solve for -tuple of costs using (IV-A)

 0=Hi(ei,Vkei,uki,uk(j)),∀i=1,…,N. (16)
###### Step 2

(Policy Improvement) Update the -tuple of control policies using (IV-A)

 uk+1i=−12R−1iigTi(xi)((lii+bii)⊗In)TVkei,∀i=1,…,N.

Go to step 1.

It does not stops until converge to , for .

Next, inspired by the linear result in [20], we give a theorem to state the convergence of the policy iteration algorithm for nonlinear case.

###### Theorem 2

(Convergence of Policy Iteration Algorithm). Assume policies of all nodes are updated at each iteration in PI algorithm. Then for small and big , converges to the Nash equilibrium and for all , and the value functions converge to the optimal value functions .

{proof}

By the following facts,

 Hi(ei,Vk+1ei,uk+1i,uk+1(j))−Hi(ei,Vkei,uki,uk(j)) =∑j∈{Ni,i}(uk+1j−ukj)TRij(uk+1j−ukj)+2∑j∈{Ni,i}(ukj)TRij(uk+1j−ukj)+Θi,

where , and

 Hi(ei,Vk+1ei,uk+1i,uk+1(j))−Hi(ei,Vkei,uki,uk(j))=ri(ei,uk+1i,uk+1(j))−ri(ei,uki,uk(j))+Θi,

we can obtain

 ri(ei,uk+1i,uk+1(j))−ri(ei,uki,uk(j))=