A Reinforcement Learning Framework for
Sequencing MultiRobot Behaviors*
Abstract
Given a list of behaviors and associated parameterized controllers for solving different individual tasks, we study the problem of selecting an optimal sequence of coordinated behaviors in multirobot systems for completing a given mission, which could not be handled by any single behavior. In addition, we are interested in the case where partial information of the underlying mission is unknown, therefore, the robots must cooperatively learn this information through their course of actions. Such problem can be formulated as an optimal decision problem in multirobot systems, however, it is in general intractable due to modeling imperfections and the curse of dimensionality of the decision variables. To circumvent these issues, we first consider an alternate formulation of the original problem through introducing a sequence of behaviors’ switching times. Our main contribution is then to propose a novel reinforcement learning based method, that combines Qlearning and online gradient descent, for solving this reformulated problem. In particular, the optimal sequence of the robots’ behaviors is found by using Qlearning while the optimal parameters of the associated controllers are obtained through an online gradient descent method. Finally, to illustrate the effectiveness of our proposed method we implement it on a team of differentialdrive robots for solving two different missions, namely, convoy protection and object manipulation.
I Introduction
In multirobot systems, there has been a great success in designing distributed controllers to address individual tasks through coordination [oh2015survey, cortes2017coordinated, antonelli2013interconnected, schwager2011unifying]. However, many complex tasks in realworld applications require the robots being capable of performing beyond what has been achieved with taskoriented controllers. In addition, it is often that the information (or model) of the underlying tasks required by the existing controllers is partially known or imperfect, making solving these tasks nontrivial. For example, consider a team of robots tasked with moving a box between two points without previous knowledge of the box’s physical properties (e.g. mass distribution, friction with the ground). Robots should first localize the box by exploring the space. Then, depending on their capabilities and available behaviors, the box should be pushed or lifted towards its destination.
Through the paradigm of behavioral robotics [arkin1998behavior] it is possible to address this complexity by sequentially combining multirobot taskoriented controllers (or behaviors); see for example [nagavalli2017automated, pierpaoli2019sequential]. In addition, to achieve a longterm autonomy and an interaction with imperfect model the robots have to choose actions based on the realtime feedback observed from the systems. The idea of learning by interacting with the system (or environment) is the main theme in reinforcement learning [bertsekas2019reinforcement, Sutton2018_book].
Our focus in this paper is to study the problem of selecting a sequence of coordinated behaviors and the associated parameterized controllers in multirobot systems for achieving a given task, which could not be handled by any individual behavior. In addition, we consider the applications where a partial information of the underlying task is unknown, therefore, the robots must cooperatively learn this information through their course of actions. Such problem can be formulated as an optimal decision problem in multirobot systems. However, it is in general intractable due to the imperfection of the problem model and the curse of dimensionality of the decision variables. To address these issues, we first formulate a behaviors switching condition based on the dynamic of the behavior itself. Our main contribution is then to propose a novel reinforcement learning based method, a combination of Qlearning and online gradient descent to find the optimal sequence of the robots’ behaviors and the optimal parameters of the associated controllers, respectively. Finally, to illustrate the effectiveness of our proposed method we implement it on a team of differentialdrive robots for solving two different tasks, namely, convoy protection and simplified object manipulation.
Our paper is organized as follows. We start by describing a class of weightedconsensus coordinated behaviors in Section II. The behavior selection problem is then formulated as in Section III, while our proposed method for solving this problem is presented in Section IV. We conclude this paper by illustrating two applications of the proposed technique to a multirobot systems in Section V.
Ia Related Work
A common approach in the design of multirobot controllers for solving specific tasks is through the use of local control rules, whose collective implementation results in the desired behavior. Weighted consensus protocols can be employed to this end. See for example [cortes2017coordinated] and references therein. The advantage of stating the multiagent control problem in terms of taskspecific controllers resides in the fact that provable guarantees exist on their convergence [zelazo2018graph].
The fundamental idea behind behavioral robotics [arkin1998behavior] is to compose taskspecific primitives into sequential or hierarchical controllers. This idea has been largely investigated for single robot systems and less so for multirobot teams, where constrains on the flow of information prevents direct application of the single robots’ algorithms. Proposed techniques for the composition of primitive controllers include formal methods [kress2018synthesis], path planning [nagavalli2017automated], Finite State Machines [marino2009behavioral], Petri Nets [klavins2000formalism], and Behavior Trees [colledanchise2014behavior]. Once an appropriate sequence of controllers is chosen, transitions between individual controllers must be feasible. Solutions to this problem include Motion Description Language [martin2012hybrid], Graph Process Specifications [twu2010graph] and Control Barrier Functions [li2018formally, pierpaoli2019sequential].
Finally, reinforcement learning offers a general paradigm for learning optimal policies in stochastic control problems based on simulation [bertsekas2019reinforcement, Sutton2018_book]. In this context, an agent seeks to find an optimal policy through interacting with the unknown environment with the goal of optimizing its longterm future reward. Motivated by broad applications of the multiagent systems, for example, mobile sensor networks [CortesMKB2004, Ogren2004_TAC] and power networks [Kar2013_QDLearning], there is a growing interest in studying multiagent reinforcement learning; see for example [zhang2018networked, Wai2018_NIPS, DoanMR2019_DTD(0)] and the references therein. Our goal in this paper is to consider another interesting application of multiagent reinforcement learning in solving the optimal behaviors selection problem over multirobot systems.
Ii Coordinated Control of MultiRobot Systems
In this section, we provide some preliminaries for the main problem formulation. We start with some general definitions in multirobot systems and then formally define the notion of behaviors used throughout the paper.
Iia MultiRobot Systems
Consider a team of robots operating in a dimensional domain, where we denote by the state of robot , for . In addition, the dynamic of the robots is governed by a single integrator given as
(1) 
where is the controller at robot , which is a function of and the states of the robots interacting with robot . The pattern of interactions between the robots is presented by an undirected graph , where and are the index set and the set of pairwise interactions between the robots, respectively. Moreover, let be the neighboring set of robot .
Each controller at robot is composed of two components, one only depends on its own state while the other represents the interaction with its neighbors. In particular, the controller in (1) is given as
(2) 
where , often referred to as an edge weight function [mesbahi2010graph], depends on the states of robot and its neighbors, and the parameter . In addition, is the statefeedback term at the robot , which depends only on its own state and a parameter representing robot ’s preference. Here, and are the feasible sets of the parameters and , respectively. A concrete example of such controller together with the associated parameters will be given in the next section. Finally, as studied in [cortes2017coordinated], one can define an appropriate energy function with respect to the graph , where the controller in (2) can be described as the negativity of the local gradient of , i.e.,
(3) 
This observation will be useful for our later development.
IiB Coordinated Behaviors in MultiRobot Systems
Given a collection of behaviors, our goal is to optimally select a sequence of them in order to complete a given mission.
Definition II.1
A coordinated behavior is defined by tuple
(4) 
where and are feasible sets for the parameters of controller (2). Moreover, is the graph representing the interaction structure between the robots.
In addition, given distinct behaviors we compactly represent them as a library of behaviors
(5) 
where each behavior is defined as in (4), i.e.,
(6) 
Here, note that the feasible sets and , and the graph are different for different behaviors, that is, in switching between different behaviors the communication graphs of the robots may be timevarying. Moreover, based on Definition II.1 it is important to note the difference between behavior and controller. The controller (2) executed by the robots for a given behavior is obtained by selecting a proper pair of parameters from the sets and . Indeed, consider a behavior and let be the ensemble states of the robots at time . In addition, let , where , be the controllers of the robots defined in (2) for a feasible pair of parameters . The ensemble dynamic of the robots associated is then given as
(7) 
To further illustrate the difference between a behavior and its associated controller, we consider the following example about a formation control problem.
Example II.2
Consider the formation control problem over a network of robots moving in a plane in Fig. 1, where the desired interrobot distances are given by a vector , with . Here, agent acts as a leader and moves toward the goal (red dot). Note that the desire formation also implies the interaction structure between the robots (graph in our setting). The goal of the robots is to maintain their desired formation while moving to the red dot. To achieve this goal, one possible choice of the edgeweight function of the controller (2) is
(8) 
while the statefeedback term except for the one at the leader given as
(9) 
In this example, is simply a subset of while is a set of geometrically feasible distances. Thus, given the formation control behavior , the controllers of the robots can be easily derived from (2).
We conclude this section with some comments on the formation control problem described above, which are the motivation for our study in the next section. In Example II.2, one can choose a single behavior together with a pair of parameters for solving the formation control problem [mesbahi2010graph]. This controller, however, is designed under the assumption that the environment is static and known, i.e., the target in Example II.2 is fixed and known by the robots. Such an assumption is less practical since in many applications, the robots are often operating in dynamically evolving and potentially unknown environments; for example, is timevarying and unknown. On the other hand, while the formation control problem can be solved by using a single behavior, many practical complex tasks require the robots to implement more than one behavior [nagavalli2017automated, pierpaoli2019sequential]. Our interest, therefore, is to consider the problem of selecting a sequence of the behaviors in for solving a given task while assuming that the state of the environment is unknown and possibly timevarying. In our setting, although the dynamic of the environment is unknown, we will assume that the robots can observe the state of the environment at any time through their course of action. We will refer to this setting as a behavior selection problem, which is formally presented in the next section. Finally, depending on the application, the environment can represent different quantities, (e.g. the target (red dot) in Example II.2).
Iii Optimal Behavior Selection Problems
In this section, we present the problem of optimal behaviors selection over a network of robots, motivated by the reinforcement learning literature. In particular, consider a team of robots cooperatively operating in an unknown environment and their goal is to complete a given mission in a time interval . Let and be the states of robots and environment at time , respectively. At any time , the robots first observe the state of the environment , select a behavior chosen from the library , compute the pair of parameters associated with , and implement the resulting controller . The environment then moves to a new state and the robots get a reward returned by the environment based on the selected behavior and tuning parameters. We assume that the rewards encode the given mission, which is motivated by the usual consideration in the literature of reinforcement learning [Sutton2018_book]. That is, solving the task is equivalent to maximizing the total accumulated rewards received by the robots. In Section V, we provide a few examples of how to design reward functions for particular applications. It is worth to point out that designing such a reward function is itself challenging and requires a good knowledge on the underlying problem [Sutton2018_book].
One can try to solve the optimal behavior selection problem by using the existing techniques in reinforcement learning. However, this problem is in general intractable since the dimension of state space is infinite, i.e., and are continuous variables. Moreover, due to the physical constraint of the robots, it is infeasible for the robots to switch to a new behavior at every time instant. That is, the robots require a finite amount of time to implement the controller of the selected behavior. Thus, to circumvent these issues we next consider an alternate version of this problem.
Inspired by the work in [mehta2006optimal] we introduce an interrupt condition , where is given in (3), defined as
(10) 
and otherwise. Here is a small positive threshold. In other words, represents a binary trigger with value whenever the network energy for a certain behavior at time is smaller than a threshold, that is, the current controller is nearly complete. Thus, it is reasonable to enforce that the robots should not switch to a new behavior at time unless for a given . Based on this observation, given a desired threshold , let be the switching time associated with the current behavior defined as
(11) 
Consequently, the mission time interval is partitioned into switching times satisfying
(12) 
where each switching time is define as in (11). Note that the number of switching time depends on the accuracy and it is not known in advance. In this paper, we do not consider the problem of optimizing the number of switching times given a threshold .
At each switching time , the robots choose one behavior based on their current states and the environment state . Next, they find a pair of parameters and implement the underlying controller for . Based on the selected behaviors and parameters, the robots receive an instantaneous reward returned by the environment as an estimate for their selection. Finally, we assume that the states of the environment at the switching times belong to a finite set , i.e., for all .
Let be the accumulative reward received by the robots at the switching times in
(13) 
As mentioned above, the optimal behavior selection is equivalent to the problem of seeking a sequence of behaviors from at and the associated parameters so that the accumulative reward is maximized. This optimization problem can be formulated as follows
(14) 
where is the unknown dynamic of the environment. Since is unknown, one cannot use dynamic programming to solve this problem. Thus, in the next section we propose a novel method for solving (14), which is a combination of learning and online gradient descent. Moreover, by introducing the switching times computing the optimal sequence of behaviors using Qlearning is now tractable.
Iv QLearning Approach For Behavior Selection
In this section we propose a novel reinforcement learning based method for solving problem (14). Our method is composed of learning and online gradient descent methods to find an optimal sequence of behaviors and the associated parameters , respectively. In particular, we maintain a Qtable, whose entry is the statebehavior value estimating the performance of behavior given the environment state . Thus, one can view the table as a matrix , where is the size of and is the number of behaviors in . The entries of Qtable are updated by using Qlearning while the controller parameters are updated by using the continuoustime online gradient method. These updates are formally presented in Algorithm 1.
In our algorithm, at each switching time the robots first observes the environment state , and then select a behavior with respect to the maximum entry in the th row of the table with tie broken arbitrarily. Next, the robots implement the distributed controller and use online gradient descent to find the best parameters associated with . Here the function represents the cost of implementing the controller at time , which can be chosen priorily by the robots (e.g., in (3)) or randomly returned by the environment. Based on the selected behavior and the associated controller, the robots receive an instantaneous reward while the environment moves to a new state . Finally, the robots updates the entry of the table using the update of learning method. It is worth to note that the Qlearning step is done in a centralized manner (either by the robots or a supervisory coordinator) since it depends on the state of the environment. Similarly, depending on the structure of the cost functions the online gradient descent updates can be implemented either distributedly or in a centralized manner. Finally, we note that the decentralized nature proper of each controller (2) is preserved.
V Applications
In this section we describe two implementations of our behaviors selection technique. For both examples, we considered a team of robots and a library of behaviors given as^{1}^{1}1These functions represent possible behaviors of the robots. In particular, for the consistency with Example II.2, all individual agents terms are described by proportional controllers with unitary gains. Unlike Example II.2, here we let be a scaling factor, which are assumed to be fixed by the formation.

[leftmargin = 5.1mm]

Static formation:
where is the desired separation between robots and , while is a shape scaling factor.

Formation with leader:
where and are defined as in the previous controller, while is the leader’s goal. Subscript leaders’ controllers.

Cyclic pursuit:
where , is the radius of the cycle formed by the robots, and . The point is the center of the cycle.

Leaderfollower:
where is the separation between the agents and is the leader’s goal.

Triangulation coverage:
(15) where is the separation between the agents in the triangulation. For all the behaviors considered above, we assume the following parameter spaces and .
In both examples, we construct the stateaction value function by implementing our propose method, Algorithm 1, on the Robotarium simulator [pickem2017robotarium], which captures a number of the features of the actual robots, such as the full kinematic model of the vehicles and collision avoidance algorithm.
Va Convoy Protection
First, we consider a convoy protection problem, where a team of robots must surround a moving target and maintain a single robottotarget distance equal to a constant at all times. Although this problem can be solved by executing a single behavior (e.g. cyclicpursuit), it allows us to compare the performance of our framework against an ideal solution. The position of the target is denoted with and it’s described by the following dynamics
(16) 
where is a constant velocity and is a zeromean Gaussian disturbance. In this case, the state of the environment at time step is considered to be the separation between robots’ centroid and the target
(17) 
where denotes the Euclidian norm. The reward provided by the environment at time is
(18) 
where the first term represents the proximity between centroid and target, while the second term weights the individual robottotarget distance. Training is executed over episodes, with an exponentially decaying greedy policy.
The plot in Fig. 2 shows the collected rewards over a trial of episodes. The results from the trained model (blue) are compared against an adhoc ideal solution (red), where cyclicpursuit behavior is recursively executed with parameters and being selected so that the resulting cycle has radius and is centered on the target’s position. Finally, we show the rewards collected when the behaviors and parameters are selected uniformly at random (green). The variance of the results from random selection could be a direct indicator for the complexity of the problem.
VB Simplified Object Manipulation
In the second example in Fig. 3, we consider a team of robots tasked with moving an object from two points. Let represent the position of the object at time . In order not to complicate the focus of the experiment, we assume a simplified manipulation dynamics. In particular, the box maintains its position if the closest robot is further than a certain threshold (i.e., object is not detected by the robots), otherwise it moves as . This manipulation dynamic guarantees that the task cannot be solved with fixed behavior and parameters since it is unknown to the robots. In this context, the robots get the following reward
(19) 
where is a constant used to weight the running time until completion of an episode and is the distance between the box and its final destination .
Vi Conclusion
In this paper, we presented a reinforcement learning based approach for solving the optimal behavior selection problem, where the robots interact with an unknown environment. Given a finite library of behaviors, our technique exploits rewards collected through interaction with the environment to solve a given task that could not be solved by any single behavior. We also provide some numerical experiments on a network of robots to illustrate the effectiveness of our method. Future directions of this work include optimal design of behavior switching times and decentralized implementation of the Qlearning update.