LAC-Nav: Collision-Free Mutiagent Navigation Based on The Local Action Cells
Collision avoidance is one of the most primary problems in the decentralized multiagent navigation: while the agents are moving towards their own targets, attentions should be paid to avoid the collisions with the others. In this paper, we introduced the concept of the local action cell, which provides for each agent a set of velocities that are safe to perform. Consequently, as long as the local action cells are updated on time and each agent selects its motion within the corresponding cell, there should be no collision caused. Furthermore, we coupled the local action cell with an adaptive learning framework, in which the performance of selected motions are evaluated and used as the references for making decisions in the following updates. The efficiency of the proposed approaches were demonstrated through the experiments for three commonly considered scenarios, where the comparisons have been made with several well studied strategies.
Collision-free navigation is a fundamental and important problem in the design of the multiagent systems, which are widely applied in the fields such as robots control and traffic engineering. When moving the agents in an environment with static or dynamic obstacles, it is usually a necessary requirement to well plan the trajectories such that no collision is caused. As the number of agents increases and the environment area becomes large, planning the realtime motions for all agents in the centralized manner causes huge amount of the calculations, and is often restricted by the efficiency of the communication between the agent and the planning monitor. Therefore, it is natural (sometimes necessary) to consider the decentralized navigation approaches, by which the individual agent is responsible for sensing the nearby obstacles and performing the proper motion to progress towards its destination without causing any collisions. On the other hand, as a consequence of the decentralized navigation, it is in general difficult for the agents to fully coordinate before making the independent moves. Thus they should also be considered and avoided as the obstacles to each other.
As noticed in the existing works, when avoiding collisions with the other agents, it is important to take into account the fact that they are also intelligent to perform the collision avoidance motions (otherwise, undesirable oscillations may be observed during the navigation). Consequently, it is not necessary for any individual agent to take all the responsibility of making sure that the performed motion is safe. ORCA [Berg2011] is a well known decentralized approach that guarantees to generate the optimal reciprocal collision-free velocities, except for some certain conditions with densely packed agents. BVC [Zhou2017] has been proposed to restrict the agents moving inside the non-intersecting areas, and thus the collision avoidance is guaranteed. After the safe field of the motions (i.e. the safe range of velocities or the safe area of positions) is determined, both of the ORCA-based approaches and the BVC-based approaches usually select the motion that is closest to the preferred motion, within the safe field. Such a greedy strategy is natural and widely used in the local-search-based optimizations. However, it may cause the less efficient performance in the multiagent navigation, as the agents may refuse to detour until there is no chance to approach the target. In the worst case, with the greedy selection, agents may get stuck in a loop of two or more situations (also known as the deadlocks). Although some tricks have been proposed to fix such drawbacks (including the ideas described in [Zhou2017]), they are not always valid in the concrete implementations, and the improvements vary from case to case.
In this work, in order to improve the navigation efficiency, we extend the buffered Voronoi cell [Zhou2017] in the velocity space, and consider the relative velocities for their effects on causing the potential conflicts. In the selection of the motion to perform, the traveling progress has been also considered, and consequently the agents may detour earlier, as long as approaching directly to the target leads to less progress in the moving distance.
In this work, we consider a set of the disk-shaped agents moving in the plane. For any time point, agent of position is free to change its velocity , and after a short time , it moves to , if there is no collisions between the agents (i.e. the distance between any pair of agents is at least the sum of their radii). For a decentralized navigation approach, it runs independently for each individual agent , and based on the observations of the environment, it updates the velocity in order to guide agent to arrive at the given and fixed destination/target . On the measure of the approach’s performance, we want all the agents arriving at their destinations/targets as soon as possible, without causing any collisions.
We introduced the concept of the local action cell to specify the underlying choices for the selection of the motion to perform, and proposed two approaches (LAC-Nav and LAC-Learn) that guarantee to provide the collision-free navigations. While the LAC-Nav approach simply perform the action of the largest penalized length (among all choices in the local action cell), the LAC-Learn approach evaluates the performed actions and adjust the selection based on an adaptive learning framework. The experiment results have shown that the proposed approaches perform more efficiently in the completion time (formally defined in the section of “Experiments”), compared to several well studied approaches.
The velocity-based collision-free navigation have been extensively studied in the last two decades. The idea of reciprocal velocity obstacles (RVO, [Berg2008]) was introduced to reduce the problem of calculating the collision-free motion to solving a low-dimensional linear program, based on the definition of velocity obstacles [Fiorini1998], and it was further improved to derive the optimal reciprocal collision avoidance (ORCA, [Berg2011]) framework, which guarantees the optimal reciprocal collision-free motions, except for some certain conditions with densely packed agents. While the safety of the final motion is guaranteed by ORCA, the ALAN [Godoy2015] online learning framework has been proposed for adapting the preferred motions of multiple agents without the need for offline training; and the CNav [Godoy2016] is designed to allow the agents to take the others’ preferred motion into account and adjust accordingly to achieve the better coordination in the crowd environments. Notice that although the efficiency of CNav has been demonstrated through the experiments, it requires the the spreading of some private information of the agents, such as their preferred motions or their targets, which is often a controversial issue in the practical applications.
As the well known Voronoi diagram can be used to divide the working space into non-intersecting areas, it has been also adopted for the collision-free path planning with multiple robots [Garrido2006, Bhattacharya2008]. Inspired by the algorithms for the coverage control of the agents [Pimenta2008], and a Voronoi-cell-based algorithm [Bandyopadhyay2014] which is introduced to avoid collisions within a larger probabilistic swarm, the buffered Voronoi cell (BVC, [Zhou2017]) approach has been proposed to achieve the collision avoidance guarantee for the multiagent navigation, based on only the information of the positions. With the up-to-date information of the others’ positions, the agents are restricted to move in the non-intersecting areas, and thus there should be no collisions. In [Senbaslar2019], a trajectory planning algorithm was proposed to navigate the agents under the higher-order dynamic limits, in which BVC is used as the low-level strategy to avoid collisions.
2 The Local Action Cells
In this paper, we assume that all the agents in have the same radius for the simplicity of the argument (for the case when the agents have different radii, the arguments in this paper can be directly extended by substituting the classical Voronoi diagram with its weighted variant). Thus for any time and any pair of non-colliding agents and , it always holds that , where stands for .
Recall that in [Zhou2017], the buffered Voronoi cell of agent is defined as
which implies a safe velocity domain
for agent to change and maintain its velocity in order to reach a point in , where is the length of the time interval between two consecutive updates. Equivalently, domain can be presented as
where is the unit vector along the same direction with , i.e. . Obviously, domain is the intersection of the half-planes ’s for each agent , with
Assuming that agent is moving at velocity and agent is moving at velocity , we estimate the colliding risk by calculating
and define the safe half-plane of agent according to agent as a subset of
where is the relax factor indicating how much the agent considers the long-sighted decision, and it is set to through this paper.
Now, we are ready to define the local action cell (LAC) of agent , denoted by , as a subset of velocities in the intersection of all the safe half-planes, i.e.
where indicates the maximum moving speed, is the destination/target of agent , denotes the angle (in radians) of the clockwise rotation of the argument vector to align with the positive direction of the -axis, and is a set of candidate angles which is defined by
through this paper. (See Figure 1 for an illustration of the local action cell of an agent moving through two neighbors.)
3 Collision-Free Navigation
In this section, we introduce a distributed approach, named LAC-Nav, for the collision-free navigation with multiple agents. As shown in Algorithm 1, the approach is straight forward with the following steps executed in loops: for each agent , calculate the current local action cell; and then select a proper velocity from the cell.
Algorithm 2 follows the definition of the local action cell and describes the calculation details; Algorithm 3 shows how the new velocity is selected: Given the current local action cell , each velocity is at first evaluated according to the penalized length , where is the factor that is initialized as and decreased exponentially on the angle between and the direction of . Finally, the velocity of the maximum penalized length is returned as the result.
While calculating the local action cells, it is not necessary to consider all the agents in the environment. When the distance between agent and agent is at least , it holds directly that and . Thus the corresponding safe half-planes can be ignored in the calculation of the agents’ local action cells, which implies it is sufficient to consider only the neighbors within distance .
When considering only the agents within a distance , the number of an agent’s neighbors is at most , since there is no overlap between the neighbors and for each of them, at least of the body is covered by the disk of radius . Consequently, the loop of Lines is executed for a constant time within one step of update of an individual agent. Thus, the processing complexity of LAC is determined by the efficiency to detect the neighbors in the specified range. In the simulations, the neighbors can be efficiently derived through querying in a KD-Tree that maintains all the positions, and in more practical cases, the neighbor detection is often executed in a parallel process, and it can be assume that the required information is always ready when it is needed.
Learning with LAC.
In LAC-Nav, the new velocity is selected according to the penalized length, which can be roughly seen as an estimate of the traveling distance of the next move. On the other hand, it is also common to evaluate the performed actions and record the results, which also provides the information that may be useful for making decisions in the future. In the case when a specific behavior should perform well for a period of time, selecting the action of the best known evaluation should be more promising than trying based on the estimates only. Generally, the evaluations are learned as the agent keeps running in the “sense-evaluate-act” cycles.
Following the ALAN learning framework [Godoy2015], we propose the LAC-Learn approach, in which the reward of the latest performed action is defined as the summation of the penalized lengths of the velocities in the resulting local action cell. Notice that by this definition, the reward naturally incorporates the considerations of the goal-oriented performance and the politeness performance, which are treated as two separate components in ALAN. In fact, the lengths of the velocities approaching to the destination reflect how efficient the performed action is for getting the agent closer to the goal; and the lengths of velocities in the local action cell as a whole reflects the efficiency in avoiding the crowding situations. In spite of the definition of the action reward, LAC-Learn selects the new velocity in a different way from the one used in ALAN. With LAC-Learn, the selected new velocity is the one corresponding to the action that maximizes a linear combination of the reward and the penalized velocity length.
Inside an execution cycle of some agent , after the local action cell is calculated by LAC (Algorithm 2), the penalized length of each velocity in is calculated as what has been done in Line of SelectVel (Algorithms 3), and saved in a set . In UpdateReward, the reward of the last performed action is updated to the sum of all weights in , as mentioned earlier.
Notice that although the velocities given by may vary from step to step, in the local view, they can always be interpreted as the actions corresponding to the angles specified in . For example, without considering the variation of the length, the velocity pointing to the destination can always be interpreted as the action corresponding to angle . For an action/angle , we use to denote the velocity such that .
Following the ALAN learning framework, we calculate (by UpdateWUCB) and maintain (in ) the upper confidence bound within a moving time window (i.e. a sequence of consecutive time steps), which is used when the agent explores in the action space. As defined in [Godoy2015], the wUCB score of action during the last steps is defined by
where is the average reward of action , denotes the number of times action has been chosen, and denotes the total number of performed actions, all with respect to the moving time window.
Similar to the context-aware action selection approach proposed in [Godoy2015], SelectAct (Algorithm 5) decides with the “win-stay, lose-shift” strategy and the adaptive -greedy strategy in which the wUCB suggested action is chosen for the exploration.
When the agent is in the winning state (i.e. the goal-oriented action is performed in the last update and is still a good choice in the sense that the corresponding velocity is little constrained), it is natural to keep forwarding to the goal. Otherwise, if the agent is in the losing state, it performs the -greedy strategy to exploit on the action that maximizes a linear combination of the action reward and the penalized length of the corresponding velocity. With a small and adaptively adjusted probability, the agent explores and performs the action that maximizes the wUCB score.
In this section, we present the results of running experiments with LAC-Nav and LAC-Learn, on a computer of Intel Core i7-6700 CPU ( GHz) processors. The simulations are implemented in Python , while the update processes of individual agents have been speeded up by applying the multitasking scheme. For one second, there are updates performed for each agent, and therefore we set in the implementation of LAC-Nav and LAC-Learn.
For the experiments, we considered three scenarios (Figure 2): the reflection scenario, the circle scenario and the crowd scenario, where
in the reflection scenario, two groups of agents start from the left side and right side of the area, respectively (Figure 2(a)). For each agent, the target is the position on the other side that is symmetric to its start position (Figure 2(d)). Through navigating the agents to the target positions, the picture of the starting configuration is reflected.
For each of these three scenarios, we compare the performances of LAC-Nav and LAC-Learn, with the performances of approaches including BVC [Zhou2017], CNav [Godoy2016], ALAN [Godoy2015] and ORCA [Berg2011].
In this work, we consider two measurements: the completion time and the average detour-distance ratio, as the evaluation of the algorithm’s performance for the multiagent navigation tasks.
the completion time of running a navigation algorithm is defined as the time (in seconds) when the last agent arrives at its target, assuming all the agents start from time ;
the average detour-distance ratio is defined as the average of the ratios between the actual travel distance and the optimal travel distance (i.e. the length of the straight line from the start position to the target position), over all the agents.
the average detour-time ratio is defined as the average of the ratios between the actual travel time and the optimal travel time (i.e. the time of moving in a straight line from the start position to the target position, at the maximum speed), over all the agents.
While the completion time justifies the algorithm’s global performance on finishing the navigation tasks, by investigating the detour-distance/time ratio, it provides a view on the variance of the individual agent’s behavior with different algorithms.
In the experiments for all scenarios, the agent’s radius is uniformly set as , and the maximum moving speed is set as . In addition, as mentioned in the beginning of this section, within each second there are updates performed for each of the agents, which implies the that the time interval between two consecutive updates is , i.e. in all the experiments.
Recall that when calculating the local action cells, the hyper-parameter is needed to locate the safe half-planes. Through all experiments involving the local action cells, we set . In addition, the penalty factor is also needed in the calculation of the local action cells (thus it is required when running LAC-Nav and LAC-Learn). Through the experiments, we set , with which the goal-orthogonal action (with angle from the direction pointing to the goal) is penalized by , and the goal-opposite action (along the direction leaving the goal) is penalized by about .
For LAC-Learn, we set the mixing factor (Line of Algorithm 5) as for the reflection scenario and the circle scenario, and for the crowd scenario. and the length of the moving time window (for the calculation of wUCB) as , which is the minimum choice as there are actions in the used . Furthermore, the incremental step (for adjusting the exploration probability) is set as (Line of Algorithm 5).
For ORCA, the collision-free time window is set as , i.e. twice of the update interval’s length. Recall that with CNav and ALAN, ORCA is also called to make sure the performed velocity is collision-free, where the time window are also set as . Notice that the time window in ORCA has different meaning from the hyper-parameter using the same symbol in the calculation of the local action cells, even though they are both related to the avoidance of the potential collisions.
For CNav, the hyper-parameter for mixing the goal-oriented reward and the constrained-reduction reward is set as for the reflection scenario, and for the circle scenario and the crowd scenario; the number of constrained neighbors of which the action’s effect is estimated is set as ; the number of the neighbor-based actions is set as .
For ALAN, the hyper-parameter for mixing the goal-oriented reward and the politeness reward is set as ; the length of the moving time window for the calculation of wUCB is set as ; and the incremental step for adjusting the exploration probability is set as .
In Figure 3, it shows the experiment results for
the reflection scenario of agents ( agents on each side);
the circle scenario of agents (in circles around the same center point);
the crowd scenario of agents located in the area of size .
Overall, LAC-Nav and LAC-Learn outperform almost all the other approaches in the completion time. The only exception is in the reflection scenario, BVC has shown the advantages and it completes earlier than LAC-Learn.
In general, the efficiency of the LAC based approaches is due to the fact that it considers both of the task to arrive at the target and the intension to move as much as possible in every step. The later consideration prevents the agent from the non-necessary halting before it arrives at the target. By maintaining and comparing the penalized lengths of all the candidate actions (according to ), even though the agent still has the change to move directly towards the target, it detours as long as there is an other action that provides a better moving (penalized) velocity. As shown in the reflection scenario, this kind of active detouring results in a more fluent navigation as the agents (of the antagonistic moving directions) pass by each other (Figure 4).
Recall that according to the definition of the safe half-planes, the local action cell is depressed if there are neighbors approaching. Therefore, with the same relative position, it is easier for an agent to “follow” a leaving-away neighbor, if they have the similar preferred trajectories. As a consequence result, in the case when there are more conflicts, such as the circle scenario, after gathered around the central area, instead of squeezing through (as what happens with the other approaches), the agents with the LAC based approaches spin as a whole to resolve the conflicts (Figure 5).
Although the local action cell can be seen as a variant or extension of the buffered Voronoi cell, it should be noticed that the LAC based approaches perform distinguishably from BVC, except for the simple situation such as the reflection scenario. For the more crowding situations (like the circle scenario and the crowd scenario), the individual agents with LAC-Nav or LAC-Learn spend more time on average, while on the other hand the global completion time is shorter. By investigating into the experiment processes, it can be found that the LAC based approaches caused less stuck agents than the other approaches did. This fact can also be revealed by checking the completion time of the first arrivals (Figure 6), in which the approaches’ performances are less distinguishable.
In this work, we introduced the definition of the local action cells, and proposed two approaches LAC-Nav and LAC-Learn, of which the efficiency in the completion time have been experimentally demonstrated. In order to improve the approaches’ performance, besides trying with different parameter values, there are some natural directions that also extend the proposed approaches or make a variant.
Recall that in the definition of the safe half-plane, we have set the relax factor as . Intuitively, indicates how much the agents would like to compromise in the next move, in order to avoid the collisions that may happen in a near future. Although it is valid to select any value from the theoretical respect, it should be noted that a very small may cause the local action cell being depressed too much, and a very large may help little for the long-sighted consideration. The value is a balanced choice, and it also follows from an important idea in the reciprocal collision avoidance: each agent take half of the responsibility to avoid the coming collisions. However, it will be more interesting if can be dynamic adjusted as the agents learned more information about the environment.
In the candidate set used through this paper, the angles between any consecutive actions are uniform. While it is natural to use another uniform candidate set of different size, say of size , or , it is also valid to include the actions between which the angles are arbitrary, such as the neighbor-based actions considered in CNav. In order to prevent the actions being penalized too much, it should be better to set close to or bound maximum penalty, when the size of becomes large.
In this paper, we defined the local action cells as sets of finite number of actions. However, it may be more natural to consider the continuous area spanned by the velocities in a cell. There is a direct way to extend the definition of the local action cell to include all the linear combinations between every pair of the adjacent velocities. Formally speaking, we can define,
which is a continuous area in the velocity space. With the continuous version the local action cell, the agents are no longer restricted to select the actions from , and they can move in any angle as long as the corresponding velocity has a positive length.