A Reinforcement Learning Framework for Sequencing Multi-Robot Behaviors*

A Reinforcement Learning Framework for
Sequencing Multi-Robot Behaviors*

Pietro Pierpaoli, Thinh T. Doan, Justin Romberg, and Magnus Egerstedt *This work was supported by the Army Research Lab under Grant DCIST CRA W911NF-17-2-0181. The authors are with the School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA. {pietro.pierpaoli,thinhdoan, magnus}@gatech.edu, jrom@ece.gatech.edu

Given a list of behaviors and associated parameterized controllers for solving different individual tasks, we study the problem of selecting an optimal sequence of coordinated behaviors in multi-robot systems for completing a given mission, which could not be handled by any single behavior. In addition, we are interested in the case where partial information of the underlying mission is unknown, therefore, the robots must cooperatively learn this information through their course of actions. Such problem can be formulated as an optimal decision problem in multi-robot systems, however, it is in general intractable due to modeling imperfections and the curse of dimensionality of the decision variables. To circumvent these issues, we first consider an alternate formulation of the original problem through introducing a sequence of behaviors’ switching times. Our main contribution is then to propose a novel reinforcement learning based method, that combines Q-learning and online gradient descent, for solving this reformulated problem. In particular, the optimal sequence of the robots’ behaviors is found by using Q-learning while the optimal parameters of the associated controllers are obtained through an online gradient descent method. Finally, to illustrate the effectiveness of our proposed method we implement it on a team of differential-drive robots for solving two different missions, namely, convoy protection and object manipulation.

I Introduction

In multi-robot systems, there has been a great success in designing distributed controllers to address individual tasks through coordination [oh2015survey, cortes2017coordinated, antonelli2013interconnected, schwager2011unifying]. However, many complex tasks in real-world applications require the robots being capable of performing beyond what has been achieved with task-oriented controllers. In addition, it is often that the information (or model) of the underlying tasks required by the existing controllers is partially known or imperfect, making solving these tasks nontrivial. For example, consider a team of robots tasked with moving a box between two points without previous knowledge of the box’s physical properties (e.g. mass distribution, friction with the ground). Robots should first localize the box by exploring the space. Then, depending on their capabilities and available behaviors, the box should be pushed or lifted towards its destination.

Through the paradigm of behavioral robotics [arkin1998behavior] it is possible to address this complexity by sequentially combining multi-robot task-oriented controllers (or behaviors); see for example [nagavalli2017automated, pierpaoli2019sequential]. In addition, to achieve a long-term autonomy and an interaction with imperfect model the robots have to choose actions based on the real-time feedback observed from the systems. The idea of learning by interacting with the system (or environment) is the main theme in reinforcement learning [bertsekas2019reinforcement, Sutton2018_book].

Our focus in this paper is to study the problem of selecting a sequence of coordinated behaviors and the associated parameterized controllers in multi-robot systems for achieving a given task, which could not be handled by any individual behavior. In addition, we consider the applications where a partial information of the underlying task is unknown, therefore, the robots must cooperatively learn this information through their course of actions. Such problem can be formulated as an optimal decision problem in multi-robot systems. However, it is in general intractable due to the imperfection of the problem model and the curse of dimensionality of the decision variables. To address these issues, we first formulate a behaviors switching condition based on the dynamic of the behavior itself. Our main contribution is then to propose a novel reinforcement learning based method, a combination of Q-learning and online gradient descent to find the optimal sequence of the robots’ behaviors and the optimal parameters of the associated controllers, respectively. Finally, to illustrate the effectiveness of our proposed method we implement it on a team of differential-drive robots for solving two different tasks, namely, convoy protection and simplified object manipulation.

Our paper is organized as follows. We start by describing a class of weighted-consensus coordinated behaviors in Section II. The behavior selection problem is then formulated as in Section III, while our proposed method for solving this problem is presented in Section IV. We conclude this paper by illustrating two applications of the proposed technique to a multi-robot systems in Section V.

I-a Related Work

A common approach in the design of multi-robot controllers for solving specific tasks is through the use of local control rules, whose collective implementation results in the desired behavior. Weighted consensus protocols can be employed to this end. See for example [cortes2017coordinated] and references therein. The advantage of stating the multi-agent control problem in terms of task-specific controllers resides in the fact that provable guarantees exist on their convergence [zelazo2018graph].

The fundamental idea behind behavioral robotics [arkin1998behavior] is to compose task-specific primitives into sequential or hierarchical controllers. This idea has been largely investigated for single robot systems and less so for multi-robot teams, where constrains on the flow of information prevents direct application of the single robots’ algorithms. Proposed techniques for the composition of primitive controllers include formal methods [kress2018synthesis], path planning [nagavalli2017automated], Finite State Machines [marino2009behavioral], Petri Nets [klavins2000formalism], and Behavior Trees [colledanchise2014behavior]. Once an appropriate sequence of controllers is chosen, transitions between individual controllers must be feasible. Solutions to this problem include Motion Description Language [martin2012hybrid], Graph Process Specifications [twu2010graph] and Control Barrier Functions [li2018formally, pierpaoli2019sequential].

Finally, reinforcement learning offers a general paradigm for learning optimal policies in stochastic control problems based on simulation [bertsekas2019reinforcement, Sutton2018_book]. In this context, an agent seeks to find an optimal policy through interacting with the unknown environment with the goal of optimizing its long-term future reward. Motivated by broad applications of the multi-agent systems, for example, mobile sensor networks [CortesMKB2004, Ogren2004_TAC] and power networks [Kar2013_QDLearning], there is a growing interest in studying multi-agent reinforcement learning; see for example [zhang2018networked, Wai2018_NIPS, DoanMR2019_DTD(0)] and the references therein. Our goal in this paper is to consider another interesting application of multi-agent reinforcement learning in solving the optimal behaviors selection problem over multi-robot systems.

Ii Coordinated Control of Multi-Robot Systems

In this section, we provide some preliminaries for the main problem formulation. We start with some general definitions in multi-robot systems and then formally define the notion of behaviors used throughout the paper.

Ii-a Multi-Robot Systems

Consider a team of robots operating in a -dimensional domain, where we denote by the state of robot , for . In addition, the dynamic of the robots is governed by a single integrator given as


where is the controller at robot , which is a function of and the states of the robots interacting with robot . The pattern of interactions between the robots is presented by an undirected graph , where and are the index set and the set of pairwise interactions between the robots, respectively. Moreover, let be the neighboring set of robot .

Each controller at robot is composed of two components, one only depends on its own state while the other represents the interaction with its neighbors. In particular, the controller in (1) is given as


where , often referred to as an edge weight function [mesbahi2010graph], depends on the states of robot and its neighbors, and the parameter . In addition, is the state-feedback term at the robot , which depends only on its own state and a parameter representing robot ’s preference. Here, and are the feasible sets of the parameters and , respectively. A concrete example of such controller together with the associated parameters will be given in the next section. Finally, as studied in [cortes2017coordinated], one can define an appropriate energy function with respect to the graph , where the controller in (2) can be described as the negativity of the local gradient of , i.e.,


This observation will be useful for our later development.

Ii-B Coordinated Behaviors in Multi-Robot Systems

Given a collection of behaviors, our goal is to optimally select a sequence of them in order to complete a given mission.

Definition II.1

A coordinated behavior is defined by tuple


where and are feasible sets for the parameters of controller (2). Moreover, is the graph representing the interaction structure between the robots.

In addition, given distinct behaviors we compactly represent them as a library of behaviors


where each behavior is defined as in (4), i.e.,


Here, note that the feasible sets and , and the graph are different for different behaviors, that is, in switching between different behaviors the communication graphs of the robots may be time-varying. Moreover, based on Definition II.1 it is important to note the difference between behavior and controller. The controller (2) executed by the robots for a given behavior is obtained by selecting a proper pair of parameters from the sets and . Indeed, consider a behavior and let be the ensemble states of the robots at time . In addition, let , where , be the controllers of the robots defined in (2) for a feasible pair of parameters . The ensemble dynamic of the robots associated is then given as


To further illustrate the difference between a behavior and its associated controller, we consider the following example about a formation control problem.

Example II.2

Consider the formation control problem over a network of robots moving in a plane in Fig. 1, where the desired inter-robot distances are given by a vector , with . Here, agent acts as a leader and moves toward the goal (red dot). Note that the desire formation also implies the interaction structure between the robots (graph in our setting). The goal of the robots is to maintain their desired formation while moving to the red dot. To achieve this goal, one possible choice of the edge-weight function of the controller (2) is


while the state-feedback term except for the one at the leader given as


In this example, is simply a subset of while is a set of geometrically feasible distances. Thus, given the formation control behavior , the controllers of the robots can be easily derived from (2).

Fig. 1: Example formation for a team of robots and one leader with goal (left). Interaction graph needed for the correct formation assembling (right).

We conclude this section with some comments on the formation control problem described above, which are the motivation for our study in the next section. In Example II.2, one can choose a single behavior together with a pair of parameters for solving the formation control problem [mesbahi2010graph]. This controller, however, is designed under the assumption that the environment is static and known, i.e., the target in Example II.2 is fixed and known by the robots. Such an assumption is less practical since in many applications, the robots are often operating in dynamically evolving and potentially unknown environments; for example, is time-varying and unknown. On the other hand, while the formation control problem can be solved by using a single behavior, many practical complex tasks require the robots to implement more than one behavior [nagavalli2017automated, pierpaoli2019sequential]. Our interest, therefore, is to consider the problem of selecting a sequence of the behaviors in for solving a given task while assuming that the state of the environment is unknown and possibly time-varying. In our setting, although the dynamic of the environment is unknown, we will assume that the robots can observe the state of the environment at any time through their course of action. We will refer to this setting as a behavior selection problem, which is formally presented in the next section. Finally, depending on the application, the environment can represent different quantities, (e.g. the target (red dot) in Example II.2).

Iii Optimal Behavior Selection Problems

In this section, we present the problem of optimal behaviors selection over a network of robots, motivated by the reinforcement learning literature. In particular, consider a team of robots cooperatively operating in an unknown environment and their goal is to complete a given mission in a time interval . Let and be the states of robots and environment at time , respectively. At any time , the robots first observe the state of the environment , select a behavior chosen from the library , compute the pair of parameters associated with , and implement the resulting controller . The environment then moves to a new state and the robots get a reward returned by the environment based on the selected behavior and tuning parameters. We assume that the rewards encode the given mission, which is motivated by the usual consideration in the literature of reinforcement learning [Sutton2018_book]. That is, solving the task is equivalent to maximizing the total accumulated rewards received by the robots. In Section V, we provide a few examples of how to design reward functions for particular applications. It is worth to point out that designing such a reward function is itself challenging and requires a good knowledge on the underlying problem [Sutton2018_book].

One can try to solve the optimal behavior selection problem by using the existing techniques in reinforcement learning. However, this problem is in general intractable since the dimension of state space is infinite, i.e., and are continuous variables. Moreover, due to the physical constraint of the robots, it is infeasible for the robots to switch to a new behavior at every time instant. That is, the robots require a finite amount of time to implement the controller of the selected behavior. Thus, to circumvent these issues we next consider an alternate version of this problem.

Inspired by the work in [mehta2006optimal] we introduce an interrupt condition , where is given in (3), defined as


and otherwise. Here is a small positive threshold. In other words, represents a binary trigger with value whenever the network energy for a certain behavior at time is smaller than a threshold, that is, the current controller is nearly complete. Thus, it is reasonable to enforce that the robots should not switch to a new behavior at time unless for a given . Based on this observation, given a desired threshold , let be the switching time associated with the current behavior defined as


Consequently, the mission time interval is partitioned into switching times satisfying


where each switching time is define as in (11). Note that the number of switching time depends on the accuracy and it is not known in advance. In this paper, we do not consider the problem of optimizing the number of switching times given a threshold .

At each switching time , the robots choose one behavior based on their current states and the environment state . Next, they find a pair of parameters and implement the underlying controller for . Based on the selected behaviors and parameters, the robots receive an instantaneous reward returned by the environment as an estimate for their selection. Finally, we assume that the states of the environment at the switching times belong to a finite set , i.e., for all .

Let be the accumulative reward received by the robots at the switching times in


As mentioned above, the optimal behavior selection is equivalent to the problem of seeking a sequence of behaviors from at and the associated parameters so that the accumulative reward is maximized. This optimization problem can be formulated as follows


where is the unknown dynamic of the environment. Since is unknown, one cannot use dynamic programming to solve this problem. Thus, in the next section we propose a novel method for solving (14), which is a combination of -learning and online gradient descent. Moreover, by introducing the switching times computing the optimal sequence of behaviors using Q-learning is now tractable.

Iv Q-Learning Approach For Behavior Selection

In this section we propose a novel reinforcement learning based method for solving problem (14). Our method is composed of -learning and online gradient descent methods to find an optimal sequence of behaviors and the associated parameters , respectively. In particular, we maintain a Q-table, whose entry is the state-behavior value estimating the performance of behavior given the environment state . Thus, one can view the -table as a matrix , where is the size of and is the number of behaviors in . The entries of Q-table are updated by using Q-learning while the controller parameters are updated by using the continuous-time online gradient method. These updates are formally presented in Algorithm 1.

In our algorithm, at each switching time the robots first observes the environment state , and then select a behavior with respect to the maximum entry in the -th row of the -table with tie broken arbitrarily. Next, the robots implement the distributed controller and use online gradient descent to find the best parameters associated with . Here the function represents the cost of implementing the controller at time , which can be chosen priorily by the robots (e.g., in (3)) or randomly returned by the environment. Based on the selected behavior and the associated controller, the robots receive an instantaneous reward while the environment moves to a new state . Finally, the robots updates the entry of the -table using the update of -learning method. It is worth to note that the Q-learning step is done in a centralized manner (either by the robots or a supervisory coordinator) since it depends on the state of the environment. Similarly, depending on the structure of the cost functions the online gradient descent updates can be implemented either distributedly or in a centralized manner. Finally, we note that the decentralized nature proper of each controller (2) is preserved.

for  do
end for
for  to  do
       Select ;
       while  do
       end while
end for
Algorithm 1 Q-learning algorithm for optimal behavior selection and tuning. The notation is used to represent variables uniformly selected from a set

V Applications

In this section we describe two implementations of our behaviors selection technique. For both examples, we considered a team of robots and a library of behaviors given as111These functions represent possible behaviors of the robots. In particular, for the consistency with Example II.2, all individual agents terms are described by proportional controllers with unitary gains. Unlike Example II.2, here we let be a scaling factor, which are assumed to be fixed by the formation.

  • [leftmargin = 5.1mm]

  • Static formation:

    where is the desired separation between robots and , while is a shape scaling factor.

  • Formation with leader:

    where and are defined as in the previous controller, while is the leader’s goal. Subscript leaders’ controllers.

  • Cyclic pursuit:

    where , is the radius of the cycle formed by the robots, and . The point is the center of the cycle.

  • Leader-follower:

    where is the separation between the agents and is the leader’s goal.

  • Triangulation coverage:


    where is the separation between the agents in the triangulation. For all the behaviors considered above, we assume the following parameter spaces and .

In both examples, we construct the state-action value function by implementing our propose method, Algorithm 1, on the Robotarium simulator  [pickem2017robotarium], which captures a number of the features of the actual robots, such as the full kinematic model of the vehicles and collision avoidance algorithm.

V-a Convoy Protection

First, we consider a convoy protection problem, where a team of robots must surround a moving target and maintain a single robot-to-target distance equal to a constant at all times. Although this problem can be solved by executing a single behavior (e.g. cyclic-pursuit), it allows us to compare the performance of our framework against an ideal solution. The position of the target is denoted with and it’s described by the following dynamics


where is a constant velocity and is a zero-mean Gaussian disturbance. In this case, the state of the environment at time step is considered to be the separation between robots’ centroid and the target


where denotes the Euclidian norm. The reward provided by the environment at time is


where the first term represents the proximity between centroid and target, while the second term weights the individual robot-to-target distance. Training is executed over episodes, with an exponentially decaying -greedy policy.

The plot in Fig. 2 shows the collected rewards over a trial of episodes. The results from the trained model (blue) are compared against an ad-hoc ideal solution (red), where cyclic-pursuit behavior is recursively executed with parameters and being selected so that the resulting cycle has radius and is centered on the target’s position. Finally, we show the rewards collected when the behaviors and parameters are selected uniformly at random (green). The variance of the results from random selection could be a direct indicator for the complexity of the problem.

Fig. 2: Comparison between accrued reward for different episodes of the convoy protection example. Rewards from trained model (blue) are compared with ad-hoc solution (red) and random behaviors/parameters selection (green).

V-B Simplified Object Manipulation

In the second example in Fig. 3, we consider a team of robots tasked with moving an object from two points. Let represent the position of the object at time . In order not to complicate the focus of the experiment, we assume a simplified manipulation dynamics. In particular, the box maintains its position if the closest robot is further than a certain threshold (i.e., object is not detected by the robots), otherwise it moves as . This manipulation dynamic guarantees that the task cannot be solved with fixed behavior and parameters since it is unknown to the robots. In this context, the robots get the following reward


where is a constant used to weight the running time until completion of an episode and is the distance between the box and its final destination .

Fig. 3: Screen shoot from Robotarium simulator during execution of the simplified object manipulation scenario at three different times. Executed behaviors are leader-follower (left), Formation with leader(center), and Cyclic-pursuit(right). Blue square represents the object which is collectively transported towards the origin (blue circle).
Fig. 4: Comparison between accrued reward over episodes of the object manipulation example. Rewards from trained model (blue) are compared with random behaviors/parameters selection (red).

Vi Conclusion

In this paper, we presented a reinforcement learning based approach for solving the optimal behavior selection problem, where the robots interact with an unknown environment. Given a finite library of behaviors, our technique exploits rewards collected through interaction with the environment to solve a given task that could not be solved by any single behavior. We also provide some numerical experiments on a network of robots to illustrate the effectiveness of our method. Future directions of this work include optimal design of behavior switching times and decentralized implementation of the Q-learning update.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description