Leaderbased Optimal Coordination Control for the Consensus Problem of Multiagent Differential Games via Fuzzy Adaptive Dynamic Programming
Abstract
In this paper, a new online scheme is presented to design the optimal coordination control for the consensus problem of multiagent differential games by fuzzy adaptive dynamic programming (FADP), which brings together game theory, generalized fuzzy hyperbolic model (GFHM) and adaptive dynamic programming (ADP). In general, the optimal coordination control for multiagent differential games is the solution of the coupled HamiltonJacobi (HJ) equations. Here, for the first time, GFHMs are used to approximate the solutions (value functions) of the coupled HJ equations, based on policy iteration (PI) algorithm. Namely, for each agent, GFHM is used to capture the mapping between the local consensus error and local value function. Since our scheme uses the singlenetwork architecture for each agent (which eliminates the action network model compared with dualnetwork architecture), it is a more reasonable architecture for multiagent systems. Furthermore, the approximation solution is utilized to obtain the optimal coordination control. Finally, we give the stability analysis for our scheme, and prove the weight estimation error and the local consensus error are uniformly ultimately bounded (UUB). Further, the control node trajectory is proven to be cooperative uniformly ultimately bounded (CUUB).
I Introduction
In recent decades, the consensus problems of multiagent systems (for instance, formation control [1], flocking [2, 3], rendezvous [4] and sensor networks [5, 6] and so on) have received considerable attention, such as [7, 8] and [9]. In the early days, consensus problems originated from computer science and formed the foundation of the field of distributed computing [10]. Subsequently, these problems were developed to management science and statistics [11]. Now, references [12] and [13] in 1980s are referred to as the pioneering work on consensus problems for control theory. In [14], OlfatiSaber and Murray presented the fundamental framework for solving consensus problems for multiagent systems. And overviews [15] and [16] have summarized the recent achievements of coordination control for consensus problems of multiagent systems.
In [15], for consensus problems, Ren et al. proposed an open research problem, that is, how to design the optimal coordination control, which not only makes multiagent systems stable, but also minimizes their performance indexes. In a physical sense, the optimal coordination control makes every agent use up the least amount of energy, and makes them reach a consensus. In fact, every agent depends on the actions of itself and all its neighborhood agents. Therefore, every agent requires to choose a control to minimize its own performance index by acting on itself, according to the outcomes of its neighborhood agents. It is similar to the multiplayer cooperative game.
Game theory [17] studies strategic decision making problems. More formally, it is “the study of mathematical models of conflict and cooperation between intelligent rational decisionmakers.” In general, if it is cooperative games, the communication among players is allowed. The decision for each player depends on the actions of himself and all the other players. In the early days, game theory was used widely for solving the problem of multiplayer games, such as, [18] and [19]. Recently, game theory has also become the theoretical basis in the field of multiagent games in [20, 21, 22]. The evolution of the agents’ state variables is governed by differential equations. The problem of finding an optimal strategy in a differential game is closely related to the optimal control theory. In particular, the closedloop strategies can be found by Bellman’s dynamic programming method, such as [18, 19, 20]. For multiagent systems, since every agent’s action depends on the outcomes of itself and all the neighborhood agents, the coupled HamiltonJacobi (HJ) equations are set up. Therefore, for multiagent differential games, the optimal coordination control relies on solving the coupled HJ equations. However, in general, it is very difficult.
Therefore, in this paper, ADP algorithm ([23] and [24]), which combines adaptive control and reinforcement learning, is introduced to learn the solution of HJ equations online for multiagent systems. The excellent overview of the stateoftheart developments of ADP algorithm has been presented in [25, 26, 28, 27]. How to approximate the value function is a key problem in the ADP algorithm. Based on Weierstrass higherorder approximation theorem [29], we know that complete basis can be used to approximate the solution of the HamiltonJacobiBellman (HJB) equation by linear expression, as . For finite , however, the approximation theorem will be sensitive to the chosen basis. If a smooth function can not be spanned by finite independent basis sets, then the group of basis sets will not be able to strictly approximate the function. Therefore, we want to choose a group of independent basis as better as possible to capture the significant features of the value function. Traditionally, neural networks are used as the approximator for it. However, neural networks do not have the clear physical significance, and activation functions (basis functions) are manually chosen. So we do not know whether the selected activation function is appropriate. It motives us to circumvent the disadvantage by using fuzzy approximation technology (fuzzy approximator). The fuzzy approximation technology can characterize the value function more reasonably by the knowledge from human experts and experiments. The generalized fuzzy hyperbolic model (GFHM) is a better selection as a function approximator [32, 30, 31] which has clear physical significance [35] (It is easy to construct an GFHM if we know some linguistic information about the relationship between function output and the input variables), and the model weights can be optimized by adaptive learning. Specially, GFHM transforms the problem (that is, how to choose basis functions in neural network model) into how to translate the input variables. In this way, the entire input space can be covered as much as possible by choosing sufficient and proper generalized input variables. So, GFHM is a better approximator for estimating the value function, such as [33] and [34].
In recent years, some optimal control methods have been proposed for the multiagent consensus problem, such as the linear quadratic regulator (LQR) technology [36] and the model predictive control (MPC) technology [37]. However, the method in[36] is only limited to the linear systems and is an offline design procedure. Though the method in [37] has obtained a good online controller for single and doubleintegrator multiagent systems (specially, the timevarying communication network), the continuous sampling and realtime predictive processes are required, and the method gets a control sequence for the finite horizon. By the way, [37] addresses the case that agents are discretetime systems with leaderless. Here, we deal with the continuous nonlinear consensus problem with the leader online through using the ADP algorithm. The algorithm can solve the coupled HJ equations directly by the policy iteration and adaptive control methods, and simultaneously avoiding the sampling and repeated predictive processes in [37]. In addition, we get an optimal function relationship of control for the infinite horizon, when the ADP algorithm does not change the adjustable weight of control.
In this paper, our major idea is to utilize game theory to solve the optimal coordination control problem for multiagent systems based on adaptive dynamic programming. By Bellman’s dynamic programming method, we construct the coupled HJ equations for multiagent differential games. To obtain the solution of the coupled HJ equations, GFHMs are used to approximate the value functions (solution) under the framework of PI algorithm [38]. It results in the errors of the coupled HJ equations. To minimize the errors resulting from GFHM approximators, the gradient descent is used to update weights of these GFHM approximators. The update of weights is implemented continuously until they do not change. We call it fuzzy adaptive dynamic programming (FADP). Finally, we analyze the stability conditions and prove the weight error and the local consensus error are uniformly ultimately bounded (UUB).
The contributions of the paper include:

The cooperative problem of multiplayer games is developed to the coordination consensus control problem of nonlinear multiagent systems. The paper builds a relationship between the optimal consensus problem for multiagent systems and Nash equilibrium of cooperative game theory.

The coupled HamiltonJacobi equations for multiagent systems are established by Bellman’s dynamic programming, and then the stability analysis is developed for our scheme.

The open problem, i.e., the optimal consensus problem for multiagent systems presented in [15], is solved by fuzzy adaptive dynamic programming with singlenetwork architecture for the first time. Namely, only one GFHM is used to approximate the local value function for each agent.
The rest of this paper is organized as follows. In section II, some definitions and notions are given. The local consensus dynamic error system is established in section III. The coupled HamiltonJacobi equations for multiagent systems are deduced, the stability of Nash equilibrium is proven and the coupled HJ equations are solved by PI algorithm in section IV. Section V derives the approximation coupled HJ equations by using GFHMs. SectionVI gives stability analysis for our scheme and proves the weight estimation error and the local consensus error are UUB, and the control node trajectory is CUUB. Finally, a numerical example is given to illustrate the effectiveness of our scheme.
Ii Preliminaries
The purpose of this section is to provide the foundations of graph theory, information consensus and generalized fuzzy hyperbolic model.
Iia Graph Theory
In this paper, graph theory is used to analyse the multiagent systems as a very helpful mathematical tool. Regardless of the unidirectional information flow or bidirectional one, the topology of a communication network can be expressed by a weighted graph.
Let be a weighted graph of nodes with the nonempty finite set of nodes , where set of edges belongs to the product space of (i.e. ), an edge of is denoted by , which is a direct path from node to node , and is a weighted adjacency matrix with nonnegative adjacency elements, i.e., , , otherwise . The node index belongs to a finite index set .
Definition 1 (Laplacian Matrix)
The graph Laplacian matrix is defined as , with being the indegree matrix of graph, where is indegree of node in graph.
Remark 1
Laplacian matrix has all row sums equal to zero.
In this paper, we assume the graph is simple, e.g. no repeated edges and no self loops. The set of neighbors of node is denoted by . A graph is referred to as a spanning tree, if there is a node (called the root), such that there is a directed path from the root to any other nodes in the graph. A digraph is said to be strongly connected, if there is a directed path from node to node , for all distinct nodes . A digraph has a spanning tree if it is strongly connected, but not vice versa.
Here, we focus on the strongly connected communication digraph with fixed topology.
IiB Consensus for Networks of Agents
A multiagent system is a network which consists of a group of agents. Every agent is called as a node in network. Let denote the state of node . We call (with the state ) a network (or algebraic graph), where . The state of a node might represent the physical quantity of the agent, such as altitude, velocity, angle, voltage and so on. We say nodes of a network have reached a consensus if and only if for all . For the consensus problem with leader, every node requires , , where is state trajectory of the leader.
IiC Generalized Fuzzy Hyperbolic Model
Definition 2
Given a plant with input variables and an output variable . We call the fuzzy rule base the generalized fuzzy hyperbolic rule base if it satisfies the following conditions:

The fuzzy rule takes the following form :
where, represents the number of transformations associated with each , and are constants that define the transformations, are fuzzy sets of which include subsets (positive) and (negative), and are constants corresponding to .

The constants in the THENpart correspond to in the IFpart. That is, if there is in the IFpart, must appear in the THENpart; otherwise, does not appear in the THENpart.

There are fuzzy rules in the rule base, where that is, all the possible and combinations of input variables in the IFpart and all the linear combinations of constants in the THENpart.
Lemma 1
[30, 32] For a multiple input single output system, , define the generalized input variables as
and the generalized fuzzy hyperbolic rule base as in Definition 2, respectively, where the membership functions of the generalized input variables and are defined as
where .
We can then derive the following model:
where is an ideal vector; with () and ; and is a constant scalar. We call it as generalized fuzzy hyperbolic model (GFHM).
Lemma 2
Remark 2
Lemma 2 shows that GFHM can uniformly approximate any nonlinear function over to any degree of accuracy if is compact, that is, the GFHM is a universal approximator (see [30] for details). Therefore, GFHM can approximate the function with error bound, by sufficient and proper generalized input variables which cover the entire space as much as possible. Here, the sufficient and proper translational quantity of input variables requires to be chosen by expertise or manual selection.
Iii Consensus error dynamic system
Consider multiagent systems with agents in the form of communication network . Their node dynamics are
(1) 
where is the state of node , is the input coordination control. and , such that and contains the origin (, is the Euclidean norm).
The global network dynamics is
(2) 
where the global state vector of the multiagent system (2) is , the global nodedynamics vector is , with and the global control input (). is the number of the nodes.
The state of the control node (or leader) is which satisfies the dynamics
(3) 
where , is the differentiable function.
The local neighborhood consensus error for node is defined as
(4) 
where (). is the pinning gain (). Note that for at least one . Then if and only if there is not a direct path from the control node to the node in ; otherwise . The nodes () are referred to as the pinned or controlled nodes.
Remark 3
The local neighborhood consensus error represents the information whether node agrees on the leader and its neighbors, that is, whether the multiagent system reach a consensus, as .
The global error vector for the network is
(5) 
with ( is an identity matrix with dimensions), where is the Laplacian matrix for the network ; and , with and is the Nvector of ones; is a diagonal matrix with diagonal entries (i.e. ). is the Kronecker product operator. Differentiating (4) or (III), the dynamics of local neighborhood consensus error for network are given by
(6) 
where with , and . is denoted as a row vector which is the row vector of the Laplacian matrix , that is, . Similarly, .
Remark 4
Since is zero when the node is not the neighbor of node , the expressions (III) only contain control inputs of all the neighbors of node and itself in network . In fact, it denotes that the local neighborhood consensus error depends on the states and the control inputs from node and all of its neighbors.
Definition 3
(Uniformly Ultimately Bounded (UUB)) The local neighborhood consensus error is uniformly ultimately bounded (UUB) if there exists a compact set so that there exists a bound and a time , both independent of , such that .
Definition 4
Iv Optimal Coordination Control
To reach a consensus while simultaneously minimizing the local performance index of every agent, we use the machinery of person cooperative games ([19, 20]) to design the optimal coordination control for the systems (III).
Iva The Coupled HJ Equation
Define the local performance indexes (cost functionals) by
(7) 
with . are the control input vectors of the neighbors of node .
All weighting matrices are constant and satisfy , and . Note that if is the control inputs of the neighbors of node , then , vice versa. Otherwise, . In other words, the performance index depends on the input information of node and its neighbors.
Problem 1
Definition 5 (Admissible Coordination Control Policies)
Under the given admissible coordination control policies and , the local value function for node is defined by
and the local coupled nonlinear Lyapunov equations for (III) are
(9) 
with . is the partial derivative of the value function with respect to .
Meanwhile, the local coupled Hamiltonians of Problem 1 are defined by
(10) 
According to the necessary condition of optimality principle, we can obtain
(11) 
Assume that the local optimal value functions satisfy the coupled HJ equations
(12) 
then, the local optimal coordination controls are
(13) 
Inserting and to (IVA), we can obtain
We can rewrite it as the coupled HJ equations (see Appendix A)
(14) 
(15) 
IvB Nash Equilibrium
First, according to [17], we introduce the Nash equilibrium definition for multiplayer games.
Definition 6 (Global Nash Equilibrium)
An tuple of control policies is referred to as a global Nash equilibrium solution for an player game (graph ) if for all
The tuple of the local performance values is known as a Nash equilibrium of the player game (graph ).
Then, two important facts are obtained by Theorem 1 below, that is, the conclusions (I) and (II).
Theorem 1
Let , be a solution to coupled HJ equations (IVA), the optimal coordination control policies () be given by (13) in term of these solutions . Then
 (I)

The local neighborhood consensus error systems (III) are asymptotically stable.
 (II)

The local performance values are equal to , ; and and are in Nash equilibrium.
First, the conclusion (I) is proven. Under the conditions, the local optimal value functions satisfy (IVA) then they also satisfy (IVA). Take the time derivative of
Since and . Therefore, is a Lyapunov function for . Furthermore, the local neighborhood consensus error system (III) is asymptotically stable.
The conclusion (II) is obvious, according to the definition of performance index, value function and Definition 6.
Remark 5
In Theorem 1, the part (II) states the fact that the solution of the equation set (IVA) is the Nash equilibrium. Note that the solution of (IVA) is not unique. In general, there exist multiple Nash equilibrium. In fact, in ADP field, the obtained optimal solution is the local optimum [46]. The globally optimal solution can not be obtained unless we explore the entire state space. However, in general, it is not possible.
Obviously, if only the coupled HJ equations (IVA) can be solved, we will obtain the Nash equilibrium for multiagent systems. However, due to the nonlinear nature of the coupled HJ equations (IVA), obtaining its analytical solution is generally difficult. Therefore, in the next section, the policy iteration algorithm is used to solve the coupled HJ equations.
IvC Policy Iteration (PI) Algorithm for the Coupled HJ Equations
In general, equations (IVA) are difficult or impossible to be solved. In the field of ADP and reinforcement learning, PI algorithm is usually used to obtain the solution of the HJB equation. Similarly, we solves the coupled HJ equations by PI algorithm, which relies on repeated policy evaluation (e.g. the solution of (IVA)) and policy improvement (the solution of (IVA)). The iteration process is implemented until the result of policy improvement no longer changes. If controls of all the nodes () do not change under the framework of PI algorithm, then they are the solution (Nash equilibrium) of the coupled HJ equations (12) or (IVA). However, it is necessary that the initial local coordination control policies must be admissible control policies in PI algorithm.
Policy Iteration Algorithm: Start with admissible initial policies .
Step 1
(Policy Evaluation) Given the tuple of policies , solve for tuple of costs using (IVA)
(16) 
Step 2
(Policy Improvement) Update the tuple of control policies using (IVA)
Go to step 1.
It does not stops until converge to , for .
Next, inspired by the linear result in [20], we give a theorem to state the convergence of the policy iteration algorithm for nonlinear case.
Theorem 2
(Convergence of Policy Iteration Algorithm). Assume policies of all nodes are updated at each iteration in PI algorithm. Then for small and big , converges to the Nash equilibrium and for all , and the value functions converge to the optimal value functions .
By the following facts,
where , and
we can obtain