\MyTitleFontCooperative Path Integral Control for Stochastic Multi-Agent Systems

# \MyTitleFontCooperative Path Integral Control for Stochastic Multi-Agent Systems

## Abstract

\MyAbstractFont

A distributed stochastic optimal control solution is presented for cooperative multi-agent systems. The network of agents is partitioned into multiple factorial subsystems, each of which consists of a central agent and neighboring agents. Local control actions that rely only on agents’ local observations are designed to optimize the joint cost functions of subsystems. When solving for the local control actions, the joint optimality equation for each subsystem is cast as a linear partial differential equation and solved using the Feynman-Kac formula. The solution and the optimal control action are then formulated as path integrals and approximated by a Monte-Carlo method. Numerical verification is provided through a simulation example consisting of a team of cooperative UAVs.

\paperwidth\paperheight1

## 1 Introduction

The research on control and planning in multi-agent systems (MASs) has been developing rapidly during the last decade with increasing demand from areas, such as cooperative vehicles [1], Internet of Things [2] and intelligent infrastructures [3]. Distinct from other control problems, control of MASs is characterized by challenges of limited information and resources of local agents, randomness of agent dynamics and communication networks, and optimality and robustness of joint performance. A good summary of recent progress in multi-agent control can be found in [4, 5, 6, 7, 8]. Building upon these results, this paper proposes a cooperative optimal control scheme by extending the path integral control (PIC) algorithm in [9] for a general type of stochastic MAS subject to limited feedback information and computational resource.

Path integral control is a model-based stochastic optimal control (SOC) algorithm that linearizes and solves the stochastic Hamilton-Jacobi-Bellman (HJB) equation with the facilitation of Cole-Hopf transformation, i.e. exponential transformation of value function [10, 11]. Compared with other continuous-time SOC techniques, since PIC formulates the optimality equations in linear form, it enjoys the superiority of closed-form solution [12] and superposition principle [13], which makes PIC a popular control scheme for robotics [14]. Some recent developments of single-agent PIC algorithms can be found in [9, 15, 12, 16].

Different from many prevailing distributed control algorithms [5, 6, 7, 8], such as consensus and synchronization that usually assume a given behavior, multi-agent SOC allows agents to have different objectives and optimizes the action choices for more general scenarios [4]. Nevertheless, it is not straightforward to extend the single-agent SOC or PIC algorithms to MASs. The exponential growth of dimensionality in MASs and the consequent surges in computation and data storage demand more sophisticated and preferably distributed planning and execution algorithms. The involvement of communication networks (and constraints) requires the multi-agent SOC algorithms to achieve stability and optimality subject to local observation and more involved cost function. While a few efforts have been made on multi-agent PIC, most of these control algorithms still depend on the knowledge of the global state, i.e. a fully connected communication network, which may not be feasible or affordable to attain in practice. Some multi-agent PIC algorithms also assume that the joint cost function can be factorized over agents, which simplifies the multi-agent control problem into multiple single-agent problems by ignoring the correlations among agents, and some features and advantages of MASs are therefore forfeited. Broek et al. investigated the multi-agent PIC problem for continuous-time systems governed by Itô diffusion process [17]; a path integral formula was put forward to approximate the optimal control actions, and a graphical model inference approach was adopted to predict the optimal path distribution; nonetheless, the optimal control policy assumed an accurate and complete knowledge of global state, and the inference was conducted on the basis of mean field approximation, which assumes that the cost function can be disjointly factorized over agents. A distributed PIC algorithm with infinite-horizon and discounted cost was applied to solving a distance-based formation problem for nonholonomic vehicular network without explicit communication topology in [18]. Cooperative PIC problem was also recently studied in [16] as an accessory result for a novel single-agent PIC algorithm; an augmented dynamics was built by piling up the dynamics of all agents, and a single-agent PIC algorithm was then applied to the augmented system. Nonetheless, the results resorting to augmented dynamics presume fully connected network and face the challenge that the computational and sampling schemes that originated from single-agent problem may become inefficient and possibly fail as the dimensions of augmented state and control grow exponentially in the number of agents.

In order to address the aforementioned issues, cooperative PIC algorithm is investigated in this paper with consideration of local observation, joint cost function, and an efficient computational method. A distributed control framework that partitions the connected communication network into multiple factorial subsystems is proposed, and the local PIC protocol of each individual agent is computed in a subsystem, which consists of an interested central agent and its neighboring agents. Under this framework, every (central) agent relying on the local observation acts optimally to minimize a joint cost function of its subsystem, and the complexities of computation and sampling are now related to the size (amount of agents) of each factorial subsystem instead of the entire network. When solving for the local optimal control action, instead of adopting the mean-field approximation and factorizing the cost function over individual agents, joint cost functions are considered inside the subsystems. The joint optimality equation of each subsystem is first cast into a joint stochastic HJB equation and formulated as a linear partial differential equation (PDE) that can be solved by the Feynman-Kac lemma. The solution of optimality equation and joint optimal control action are then formulated as path integrals and approximated by Monte-Carlo (MC) method. Parallel random sampling technique is introduced to accelerate and parallelize the approximation of the PIC solutions, and state measurements and sampled trajectory data are exchanged between neighboring agents. Illustrative examples of a cooperative UAV team are presented to verify the effectiveness and advantages of cooperative PIC algorithm.

This paper is organized as follows: Section II formulates the control problem; Section III investigates the cooperative PIC algorithm; Section IV presents the simulation example, and Section V draws the conclusions. For a matrix and a vector , denotes the determinant of , and . For a set , denotes its cardinality.

## 2 Problem Formulation

The mathematical representation for MAS, the cooperative control framework, including the stochastic dynamics and optimal control formulation are introduced in this section.

### 2.1 Multi-Agent System and Factorial Subsystems

For a MAS with homogeneous agents indexed by , we use a connected graph to represent the bilateral communication network underlying this MAS, where vertex denotes agent and undirected edge , if agent and can directly exchange their information with each other. For agent , is the index set of all agents neighboring agent . The factorial subsystem comprises a central agent and its neighboring agents . Figure 1 shows an example of a MAS and two of its factorial subsystems.

Our cooperative control framework is a trade-off scheme between the cooperation among agents and computational complexity. Instead of optimizing a fully factorized cost function based on the mean-field assumption [17] or a global cost function relying on the knowledge of global states [16], we design the local control action , which depends on the local observation of agent , by minimizing the joint cost function of subsystem . Under this cooperative control scheme, we not only capture and optimize the correlation between neighboring agents, but circumvent the dependency on the global state as well as the exponential growth of the global state dimension.

### 2.2 Stochastic Optimal Control

Consider a network of agents with homogeneous and mutually independent passive dynamics. We use the following Itô diffusion process to describe the joint dynamics of subsystem

 d¯xi=¯fi(¯xi,t)dt+¯Bi(¯xi)[¯ui(¯xi,t)dt+¯σid¯wi], (1)

where is the joint states of subsystem with denoting the state dimension of a single agent; , , and are the joint passive dynamics, joint control matrix, and joint control input of subsystem with being the input dimension of a single agent; joint noise is a vector of Brownian components with zero mean and unit rate of variance, and is a positive semi-definite matrix that denotes the joint covariance of noise . By rearranging the components in , the stochastic dynamics (1) can be partitioned into a non-directly actuated part with states and a directly actuated part with states , where and respectively denote the dimensions of non-directly actuated states and directly actuated states for a single agent. Consequently, the joint passive dynamics and control matrix can respectively be partitioned as and .

Let denote the set of joint interior states in subsystem . When , the joint running cost function of is defined as

 ci(¯xi,¯ui)=qi(¯xi)+12¯ui(¯xi,t)⊤¯Ri¯ui(¯xi,t), (2)

where is a state-related cost, and is a control-quadratic term with being positive definite. Let denote the set of joint exit states in subsystem . When , the terminal cost function is defined as , where indicates the exit time. A cost-to-go function for first-exit problem subject to control policy can then be defined as . We can minimize by solving for the joint optimal control action from the following joint optimality equation

 Vi(¯xi,t)=min¯uiE¯ui¯xti,t[ϕi(¯xtfi)+∫tftci(¯xi(τ),¯ui(τ)) dτ], (3)

where the value function is the expected cumulative running cost for starting at and acting optimally thereafter. By following the local optimal control action marginalized from , each (central) agent acts optimally to minimize , while the (global) optimality condition in (3) can only be attained when is fully connected, since the local optimal control action of (neighboring) agent usually does not accord with . This conflict widely exists in distributed optimal control and optimization problems when the networks are subject to local/partial observation and limited communication, and some serious and heuristic studies on the global- and sub-optimality of distributed systems can be found in [19, 6, 20, 21] and references therein. We will not dive into this technical detail, as the objective of this paper is to propose a sub-optimal PIC scheme with sufficient computation and sample efficiency in networked MAS.

## 3 Cooperative Path Integral Control

The joint optimality equation (3) is first cast as a linear PDE that can be solved by the Feynman-Kac formula. The solution and the joint optimal control action are then formulated as path integrals that can be approximated by distributed stochastic sampling.

### 3.1 Linear Formulation

We first formulate the joint optimality equation (3) as a joint HJB equation and cast it as a linear PDE with the following exponential transformation, which is also known as the Cole-Hopf transformation [10]:

 Z(¯xi,t)=exp[−Vi(¯xi,t)/λi], (4)

where is a scalar, and is the desirability function of joint state at time . The following theorem states the joint HJB equation, the joint optimal control action , and a linear formulation for (3) along with a closed-form solution.

###### Theorem 1.

For the factorial subsystem subject to joint dynamics (1) and running cost function (2), the joint optimality equation (3) is equivalent to the following joint stochastic HJB equation

 −∂tVi(¯xi,t)=min¯uiE¯ui¯xi,t[12¯ui(¯xi,t)⊤¯Ri¯ui(¯xi,t)+qi(¯xi,t)+∑j∈¯Ni[fj(xj,t) (5) +Bj(xj)uj(¯xi,t)]⊤∇xjVi(¯xi,t)+12∑j∈¯Nitr(Bj(xj)σjσ⊤jBj(xj)⊤∇xjxjVi(¯xi,t))\allowdisplaybreaks]

with boundary condition . The minimum of (5) is attained by the joint optimal control action

 ¯u∗i(¯xi,t)=−¯R−1i¯Bi(¯xi)⊤∇¯xiVi(¯xi,t). (6)

With transformation (4), control action (6) and condition , the joint stochastic HJB equation (5) can be formulated as

 (7)

with boundary condition and has a closed-form solution

 Zi(¯xi,t)=E¯xi,t⎡⎣exp⎛⎝−ϕi(¯ytfi)λi−∫tftqi(¯yi,τ)λi dτ⎞⎠⎤⎦, (8)

where the diffusion process is subject to initiated at .

###### Proof.

See Appendix I for the proof. ∎

The condition implies that high control cost is assigned to a control channel with low variance noise, while a control channel with high variance noise has cheap control cost. With some auxiliary techniques this condition can be relaxed [22].

### 3.2 Path Integral Approximation

While a closed-form solution for is given in Theorem 1, the expectation over all uncontrolled trajectories initiated at is intractable to compute. A conventional approach in statistical physics for evaluating this expectation is to rewrite it as a path integral and approximate the integral with sampling methods. The following proposition gives the path integral formulae for the desirability function and the joint optimal control action .

###### Proposition 2.

Divide the time span from to into intervals of even length , , and let denote the joint trajectory segment on time interval subject to joint dynamics (1), zero control , and initial condition . The desirability function (8) can then be reformulated as a path integral

where the path variable represents all uncontrolled trajectories of subsystem starting at , and the generalized path value

 ~Sε,λii(¯x(0)i,¯ℓi,t0)=ϕi(¯x(K)i)λi+ελiK−1∑k=0qi(¯x(k)i,tk)+ε2λiK−1∑k=0∥∥α(k)i∥∥2(H(k)i)−1+12K−1∑k=0log∣∣H(k)i∣∣ (9)

with and . Hence, the joint optimal control action for subsystem can then be reformulated as a path integral

 ¯u∗i(¯xi,t)=¯σi¯σ⊤i¯Bi(d)(xi)⊤⋅limε↓0∫~p∗i(¯ℓi|¯x(0)i,t0)⋅~ui(¯x(0)i,¯ℓi,t0) d¯ℓi, (10)

where the optimal path distribution is

 ~p∗i(¯ℓi|¯x(0)i,t0)=exp(−~Sε,λii(¯x(0)i,¯ℓi,t0))∫exp(−~Sε,λii(¯x(0)i,¯ℓi,t0)) d¯ℓi, (11)

and the initial control vector is

 ~ui(¯x(0)i,¯ℓi,t0)=−ελi∇¯x(0)i(d)qi(¯x(0)i,t0)+(H(0)i)−1α(0)i. (12)
###### Proof.

See Appendix II for the proof. ∎

We then approximate the PIC (10) and optimal path distribution (11) with MC method. Given a batch of uncontrolled trajectories , we can estimate the optimal path distribution (11) with the following sampling estimator

 ~p∗i(¯ℓ[y]i|¯x(0)i,t0)≈exp(−~Sε,λii(¯x(0)i,¯ℓ[y]i,t0))∑Yy=1exp(−~Sε,λii(¯x(0)i,¯ℓ[y]i,t0)), (13)

where denotes the generalized path value of sampled trajectory . Hence, an estimator for joint optimal control action (10) can be

 ¯u∗i=¯σi¯σ⊤i¯Bi(d)(xi)⊤Y∑y=1~p∗i(¯ℓ[y]i|¯x(0)i,t0)~ui(¯x(0)i,¯ℓ[y]i,t0), (14)

where is the initial control vector of sampled trajectory . For a single agent, the sampling procedure can be expedited by the parallel computation of GPU units [16]. Meanwhile, instead of generating the joint trajectory set from a single agent, we can exploit the computation resource of MAS by letting each agent to sample its local trajectories and assembling the joint trajectory set via communication. With the estimation of joint optimal control action from (14), central agent acts by following the local control action extracted from . Algorithm 1 summaries the procedures of cooperative PIC in stochastic MASs.

## 4 Simulation Example

We demonstrate the cooperative PIC algorithm in Algorithm 1 with a team of cooperative UAVs. Cooperation among UAVs is essential for the tasks that cannot be accomplished by a single UAV and demand multiple UAVs, such as information communication, lifting and carrying heavy load, and patrol with synthetic sensors, which require cooperative UAVs to fulfill some certain constraints, e.g. flying closely or maintaining an identical orientation or speed. First, we consider a UAV team with three agents and subject to the communication network in Figure 2.

UAV 1 and 3, subject to their correlated running cost functions respectively, are tightly connected such that they can cooperate with each other while flying towards their destinations. By contrast, UAV 2, subject to an independent running cost function and only coupled with UAV 1 and 3 via their terminal costs as in [17], is loosely connected with other UAVs and will fly to its destination independently. Each UAV is described by the following UAV dynamics [17, 23]:

 ⎛⎜ ⎜ ⎜⎝dxidyidvidφi⎞⎟ ⎟ ⎟⎠=⎛⎜ ⎜ ⎜⎝vicosφivisinφi00⎞⎟ ⎟ ⎟⎠dt+⎛⎜ ⎜ ⎜⎝00001001⎞⎟ ⎟ ⎟⎠[(uiωi)dt+(σi00νi)dwi], (15)

where , , and denote the position, forward velocity, and heading angle of the -th UAV, respectively; forward acceleration and angular velocity are the control inputs, and disturbance is a standard Brownian motion. Control matrix is constant in (15), and we set noise level parameters and in simulation. In order to achieve the requirements that UAV 1 and 3 fly closely towards their destination, and UAV 2 independently flies to its destination, we design the following state-related cost functions

 q1(¯x1) =w11⋅(∥(x1,y1)−(xtf1,ytf1)∥2−dmax1)+w13⋅(∥(x1,y1)−(x3,y3)∥2−dmax13), q2(¯x2) =w22⋅(∥(x2,y2)−(xtf2,ytf2)∥2−dmax2), (16) q3(¯x3) =w33⋅(∥(x3,y3)−(xtf3,ytf3)∥2−dmax3)+w31⋅(∥(x3,y3)−(x1,y1)∥2−dmax31),

where is the weight that contributes to driving agent towards its exit state ; is the weight related to the distance between UAVs and , and and are the regularization terms for numerical stability, which are respectively assigned by the initial or maximal values of and .

In order to verify the improvements brought by cooperative or joint cost functions, we demonstrate an identical flight task while alternating the value of in (4). When , UAVs 1 and 3 cooperate with each other by flying closely together, while when , we restore the factorial running cost functions considered in [17], and UAVs are only correlated via their terminal cost functions. For three UAVs with initial states , and , we want them to arrive at an identical terminal state in sec. The period of each control cycle is sec. When generating the trajectory roll-outs , the time interval from to is divided into intervals of equal length , i.e. , until becomes less than sec. The noise level parameters are increased to and to improve the sampling and exploration efficiency. The size of in estimator (14) is sampled trajectories for each control cycle. Control matrices are chosen as identity matrices. The trajectories of UAVs and the relative distance between UAVs 1 and 3 subject to different cost functions are respectively presented in Figure 3 and  Figure 4. The significant reduction of distance between UAVs 1 and 3 in Figure 3, and Figure 4 corroborates that our cooperative PIC algorithm facilitates the interaction and cooperation among the neighboring agents of MASs. For other types of cooperation, such as maintaining an identical orientation, one can realize them by accordingly designing the state-related cost functions. To demonstrate that our cooperative PIC is able to address more complicated tasks, such as obstacle avoidance and multiple objectives, we introduce obstacles (shaded regions) and assign different boundary states to agents in Figure 5.

In the end, we test our cooperative PIC scheme on a larger UAV network with 9 agents as shown in Figure 6. For UAV , the initial state is at , and the state-related cost function is defined by

 qi(¯xi)=wii⋅(∥(xi,yi)−(xtfi,ytfi)∥2−dmaxi)+wi,i−1⋅(∥(xi,yi)−(xi−1,yi−1)∥−dmaxi,i−1)+wi,i+1⋅(∥(xi,yi)−(xi+1,yi+1)∥−dmaxi,i+1) (17)

where is the weight related to the distance between agent and ; when or ; is the regularization term for numerical stability, which is assigned by the initial distance between agent and in this example, and the rest of notations are the same as in (4). Agents 1 to 6, which share an identical terminal state at with sec, are tightly connected via their correlated running cost functions with , , and . The connection between agents 6 and 7 is loose with . Agents 7 and 8, which have an identical exit state at , are tightly correlated with and . Agent 9 is loosely connected with agent 8 with , , and exit state at . Other parameters are the same as in the first example. A simulation result is presented in Figure 7. For some network topology, e.g. loop as in Figure 2, complete binary tree or line as in Figure 6, in which the size of every factorial subsystem is tractable, increasing the total amount of agents in network will not significantly escalate the computation burden on each agent for this cooperative PIC algorithm.

## 5 Conclusion

A cooperative path integral control algorithm for stochastic MASs has been investigated in this paper. A distributed control framework that relies on the local observations of agents and hence circumvents the curse of dimensionality when the number of agents increases has been proposed, and a cooperative path integral algorithm has been designed to guide each agent to cooperate with its neighboring agents and behave optimally to minimize their joint cost function. Simulation examples have been presented to verify the algorithm.

## Acknowledgment

This work is supported in part by AFOSR under Grant FA9550-15-1-0518, NSF under Grant CCF-1717154, CMMI-1663460, CNS-1932288, CNS-1932529, ECCS-1739732, NRI ECCS-1830639, and ZJU-UIUC Institute Research Program. In this arXiv version, the authors would also like to thank the readers and staff on arXiv.org.

## Appendix I: Proof of Theorem 1

We first show that the joint optimality equation (3) is equivalent to the joint HJB equation (5) and derive the joint optimal control . Substituting (2) into (3) and letting be a time step between initial time and exit time , the joint equation (3) can be rewritten as

 Vi(¯xi,t)=min¯uiE¯ui¯xi,t[Vi(¯xi,s)+∫stqi(¯xi,τ)+12¯ui(¯xi,τ)⊤¯Ri¯ui(¯xi,τ) dτ]. (18)

Dividing both sides of (18) by and letting , we have

 0=min¯uiE¯ui¯xi,t[dVi(¯xi,t)dt+qi(¯xi,t)+12¯ui(¯xi,t)⊤¯Ri¯ui(¯xi,t)]. (19)

Applying Itô’s lemma [24] to , dividing the result by , and taking the expectation over all possible trajectories starting at and subject to , we arrive at

 E¯ui¯xi,t[dVi(¯xi,t)dt]=∂Vi(¯xi,t)∂t+∑j∈¯Ni[fj(xj,t)+Bj(xj)uj(¯xi,t)]⊤⋅∇xjVi(¯xi,t)+12∑j∈¯Nitr(Bj(xj)σjσ⊤jBj(xj)⊤⋅∇xjxjVi(¯xi,t)), (20)

where operators and respectively refer to the gradient and Hessian matrix. Substituting (20) into (19), the joint stochastic HJB equation (5) in Theorem 2 is obtained. Taking the derivative of (5) w.r.t. and setting the result to zero, we can obtain the joint optimal control action in (6).

We then formulate the joint HJB equation (5) as a linear PDE by the Cole-Hopf transformation (4). Subject to (4), the agent-wise terms in (5) can be rewritten as

 [Bj(xj)uj(¯xi,t)]⊤∇xjVi(¯xi,t)+12uj(¯xi,t)⊤Rjuj(¯xi,t) (21) =−λ2i2Zi(¯xi,t)2∇xjZi(¯xi,t)⊤⋅Bj(xj)R−1jBj(xj)⊤∇xjZi(¯xi,