Learning Vision-based Cohesive Flight in Drone Swarms

# Learning Vision-based Cohesive Flight in Drone Swarms

\nameschilling\tno, \namelecoeur\tno, \nameschiano\tno, and \namefloreano
Laboratory of Intelligent Systems
École Polytechnique Fédérale de Lausanne
CH-1015 Lausanne, Switzerland
\ttt{fabian.schilling,julien.lecoeur,fabrizio.schiano,dario.floreano}@epfl.ch
###### Abstract

This paper presents a data-driven approach to learning vision-based collective behavior from a simple flocking algorithm. We simulate a swarm of quadrotor drones and formulate the controller as a regression problem in which we generate 3D velocity commands directly from raw camera images. The dataset is created by simultaneously acquiring omnidirectional images and computing the corresponding control command from the flocking algorithm. We show that a convolutional neural network trained on the visual inputs of the drone can learn not only robust collision avoidance but also coherence of the flock in a sample-efficient manner. The neural controller effectively learns to localize other agents in the visual input, which we show by visualizing the regions with the most influence on the motion of an agent. This weakly supervised saliency map can be computed efficiently and may be used as a prior for subsequent detection and relative localization of other agents. We remove the dependence on sharing positions among flock members by taking only local visual information into account for control. Our work can therefore be seen as the first step towards a fully decentralized, vision-based flock without the need for communication or visual markers to aid detection of other agents.

Learning Vision-based Cohesive Flight in Drone Swarms

\nameschilling\tno, \namelecoeur\tno, \nameschiano\tno, and \namefloreano Laboratory of Intelligent Systems École Polytechnique Fédérale de Lausanne CH-1015 Lausanne, Switzerland \ttt{fabian.schilling,julien.lecoeur,fabrizio.schiano,dario.floreano}@epfl.ch

## 1 Introduction

Collective motion of animal groups such as flocks of birds is an awe-inspiring natural phenomenon that has profound implications for the field of aerial swarm robotics (??). Animal groups in nature operate in a completely self-organized manner since the interactions between them are purely local and decisions are made by the animals themselves. By taking inspiration from decentralization in biological systems, we can develop powerful robotic swarms that are 1) robust to failure, and 2) highly scalable since the number of agents can be increased or decreased depending on the workload.

One of the most appealing characteristics of collective animal behavior for robotics is that decisions are made based on local information such as visual perception. As of today, however, most multi-agent robotic systems rely on entirely centralized control (????) or wireless communication of positions (???), either from a motion capture system or global navigation satellite system (GNSS). The main drawback of these approaches is the introduction of a single point of failure, as well as the use of unreliable data links, respectively. Relying on centralized control bears a significant risk since the agents lack the autonomy to make their own decisions in failure cases such as a communication outage. The possibility of failure is even higher in dense urban environments, where GNSS measurements are often unreliable and imprecise.

Vision is arguably the most promising sensory modality to achieve a maximum level of autonomy for robotic systems, particularly considering the recent advances in computer vision and deep learning (???). Apart from being light-weight and having relatively low power consumption, even cheap commodity cameras provide an unparalleled information density with respect to sensors of similar cost. Their characteristics are specifically desirable for the deployment of an aerial multi-robot system. The difficulty when using cameras for robot control is the interpretation of the visual information which is a hard problem that this paper addresses directly.

In this work, we propose a reactive control strategy based only on local visual information. We formulate the swarm interactions as a regression problem in which we predict control commands as a nonlinear function of the visual input of a single agent. To the best of our knowledge, this is the first successful attempt to learn vision-based swarm behaviors such as collision-free navigation in an end-to-end manner directly from raw images.

## 2 Related Work

We classify the related work into three main categories. Sec. 2.1 considers literature in which a flock of drones is controlled in a fully decentralized manner. Sec. 2.2 is comprised of recent data-driven advances in vision-based drone control. Finally, Sec. 2.3 combines ideas from the previous sections into approaches that are both vision-based and decentralized.

### 2.1 Decentralized flocking with drones

Flocks of autonomous drones such as quadrotors and fixed-wings are the focus of recent research in swarm robotics. Early work presents ten fixed-wing drones deployed in an outdoor environment (?). Their collective motion is based on Reynolds flocking (?) with a migration term that allows the flock to navigate towards the desired goal. Arising from the use of a nonholonomic platform, the authors study the interplay of the communication range and the maximum turning rate of the agents.

Thus far, the largest decentralized quadrotor flock consisted of 30 autonomous agents flying in an outdoor environment (?). The underlying algorithm has many free parameters which require the use of optimization methods. To this end, the authors employ an evolutionary algorithm to find the best flocking parameters according to a fitness function that relies on several order parameters. The swarm can operate in pre-defined confined space by incorporating repulsive virtual agents.

The commonality of all mentioned approaches and others, for example, (??), is the ability to share GNSS positions wirelessly among flock members. However, there are many situations in which wireless communication is unreliable or GNSS positions are too imprecise. We may not be able to tolerate position imprecisions in situations where the environment requires a small inter-agent distance, for example when traversing narrow passages in urban environments. In these situations, tall buildings may deflect the signal and communication outages occur due to the wireless bands being over-utilized.

### 2.2 Vision-based single drone control

Vision-based control of a single flying robot is facilitated by several recent advances in the field of machine learning. In particular, the controllers are based on three types of learning methods: imitation learning, supervised learning, and reinforcement learning.

Imitation learning is used in (?) to control a drone in a forest environment based on human pilot demonstrations. The authors motivate the importance of following suboptimal control policies in order to cover more of the state space. The reactive controller can avoid trees by adapting the heading of the drone; the limiting factor is ultimately the field of view of a single front-facing camera.

A supervised learning approach (?) features a convolutional network that is used to predict a steering angle and a collision probability for drone navigation in urban environments. Steering angle prediction is formulated as a regression problem by minimizing the mean-squared-error between predicted and ground truth annotated examples from a dataset geared for autonomous driving research. The probability of collision is learned by minimizing the binary cross-entropy of labeled images collected while riding a bicycle through urban environments. The drone is controlled directly by the steering angle, whereas its forward velocity is modulated by the collision probability.

An approach based on reinforcement learning (?) shows that a neural network trained entirely in a simulated environment can generalize to flights in the real world. In contrast with the previous methods based only on supervised learning, the authors additionally employ a reinforcement learning approach to derive a robust control policy from simulated data. A 3D modeling suite is used to render various hallway configurations with randomized visual conditions. Surprisingly, the control policy trained entirely in the simulated environment is able to navigate real-world corridors and rarely leads to collisions.

The work described above and other similar methods, for instance, (???), use a data-driven approach to control a flying robot in real-world environments. A shortcoming of these methods is that the learned controllers operate only in two-dimensional space which bears similar characteristics to navigation with ground robots. Moreover, the approaches do not show the ability of the controllers to coordinate a multi-agent system.

### 2.3 Vision-based multi-drone control

The control of multiple agents based on visual inputs is achieved with relative localization techniques (?) for a group of three quadrotors. Each agent is equipped with a camera and a circular marker that enables the detection of other agents and the estimation of relative distance. The system relies only on local information obtained from the onboard cameras in near real-time.

Thus far, decentralized vision-based drone control has been realized by mounting visual markers on the drones (???). Although this simplifies the relative localization problem significantly, the marker-based approach would not be desirable for real-world deployment of flying robots. The used visual markers are relatively large and bulky which unnecessarily adds weight and drag to the platform; this is especially detrimental in real-world conditions.

## 3 Method

At the core of our method lies the prediction of a velocity command for each agent that matches the velocity command computed by a flocking algorithm as closely as possible. For the remainder of the section, we consider the velocity command from the flocking algorithm as the target for a supervised learning problem. The fundamental idea is to eliminate the dependence on the knowledge of the positions of other other agents by processing only local visual information. Fig. 2 provides an overview of our method.

### 3.1 Flocking algorithm

We use an adaptation of Reynolds flocking (?) to generate targets for our learning algorithm. In particular, we only consider the collision avoidance and flock centering terms from the original formulation since they only depend on relative positions. We omit the velocity matching term since estimating the velocities of other agents is an extremely difficult task given only a single snapshot in time (see Sec. 3.3). One would have to rely on estimating the orientation and heading with relatively high precision in order to infer velocities from a single image.

In our formulation of the flocking algorithm, we use the terms separation and cohesion to denote collision avoidance and flock centering, respectively (?). We further add an optional migration term that enables the agents to navigate towards a goal.

The first consideration when modeling the desired behavior of the flock is the notion of neighbor selection. It is reasonable to assume that each agent can only perceive its neighbors in a limited range. We therefore only consider agents as neighbors if they are closer than the desired cutoff distance which corresponds to only selecting agents in a sphere with a given radius. Therefore, we denote the set of neighbors of an agent as the set

 Ni={agents\leavevmode\nobreak j:j≠i∧∥rij∥

where denotes the relative position of agent with respect to agent and the Euclidean norm. We compute where denotes the absolute position of agent .

The separation term steers an agent away from its neighbors in order to avoid collisions. The separation velocity command for the th agent can thus be formalized as

 vsepi=−ksep|Ni|∑j∈Nirij∥rij∥2 (2)

where is the separation gain which modulates the strength of the separation between agents.

The cohesion term can be seen as the antagonistic inverse of the separation term since its purpose is to steer an agent towards its neighbors to provide cohesiveness to the group. The cohesion velocity command for the th agent can be written as

 vcohi=kcoh|Ni|∑j∈Nirij (3)

where is called the cohesion gain and modulates the tendency for the agents to be drawn towards the center of the neighboring agents.

For our implementation, the separation and cohesion terms are sufficient to generate a collision-free flock in which agents remain together, given that the separation and cohesion gains are chosen carefully. We denote the combination of the two terms as the Reynolds velocity command which is later predicted by the neural network.

Moreover, the addition of the migration term provides the possibility to give a uniform navigation goal to all agents. The corresponding migration velocity command is given by

 vmigi=kmigrmigi∥rmigi∥ (4)

where denotes the migration gain and denotes the relative position of the migration point with respect to agent . We compute where is the absolute position of the migration point.

The velocity command for an agent is computed as a sum of the Reynolds terms, which is a combination of separation and cohesion, as well as the migration term, as . In general, we assume a homogeneous flock, which means that all agents are given the same gains for separation, cohesion, and migration.

A final parameter to adjust the behavior of the flock is the cutoff of the maximum speed. The final velocity command that steers an agent is given by

 vi=~vi∥~vi∥min(∥~vi∥,vmax) (5)

where denotes the desired maximum speed of an agent during flocking.

### 3.2 Drone model

Simulations were performed in Gazebo with a group of nine quadrotor drones. Each drone is equipped with six simulated cameras which provide omnidirectional vision. The cameras are positioned away from the center of gravity of the drone in order to have an unobstructed view of the surrounding environment, including the propellers (see Fig. 2(a)).

Each camera has a horizontal and vertical field of view and takes a grayscale image of pixels with a refresh rate of Hz. We concatenate the images from all six cameras to a pixels grayscale image.

We use the PX4 autopilot (?) and provide it with velocity commands as raw setpoints via a ROS node. The heading of the drone is always aligned with the given velocity command.

### 3.3 Dataset generation

We generate the dataset used for training the regressor entirely in a physics-based simulation environment. Since our objective is to cover the maximum possible state and command space encountered during flocking, we generate our dataset with random trajectories as opposed to trajectories generated by the flocking algorithm itself. In other words, we acquire an image and compute a ground truth velocity command from our flocking algorithm while following a random linear trajectory. This was explicitly done to ensure that the dataset contains agent states which the flocking algorithm would not generate in order to improve robustness to unseen agent configurations. Note that initial experiments with agents following trajectories generated by the flocking algorithm resulted in collisions between agents. This observation is in line with the finding in (?) that situations not encountered during the training phase cannot possibly be learned by the controller. A sample of the dataset is thus a tuple that is acquired for each agent every . The dataset is generated in multiple runs, each of which contains images and ground truth velocity commands generated by following a randomized linear trajectory as described below. We generate samples for training, samples for validation, and samples for testing.

The agents are spawned at random non-overlapping positions around the origin in a cube with side length of and a minimum distance to any other agent . The side length and minimum distance were chosen to resemble a real-world deployment scenario of a drone swarm in a confined environment such as a narrow passage between adjacent buildings. Each agent is then assigned a linear trajectory by following a velocity command which is drawn uniformly inside a unit cone with an angle of . The mean velocity command is thus facing directly in the direction of the migration point as seen from the origin. The velocity command is distinct for each agent and kept constant during the entire run. The random velocity commands were chosen such that collisions and dispersions are encouraged.

A run is considered complete as soon as a) the migration point is reached by at least one agent, or b) any pair of agents become either too close or c) too dispersed, all while following their linear trajectory. We consider the migration point reached as soon as an agent comes within a radius of the migration point. We consider the agents too close if any pair of drones falls below a collision threshold of . Similarly, we regard agents as too dispersed when the distance between any two drones exceeds the dispersion threshold of . The collision threshold follows the constraints of the drone model used in simulation and the dispersion threshold stems from the diminishing size of other agents in the field of view.

### 3.4 Training phase

We formulate the imitation of the flocking algorithm as a regression problem which takes an image (see Fig. 2(a)) as an input and predicts a velocity command which matches the ground truth velocity command as closely as possible. To produce the desired velocities, we consider a state-of-the-art convolutional neural network (?) that is used for drone navigation. The model is composed of several convolutional layers with ReLU activations (?) and finally a fully connected layer with dropout (?) to avoid overfitting. Unlike (?) we opt for a single-head regression architecture to avoid convergence problems caused by different gradient magnitudes from an additional classification objective during training. This simplifies the optimization problem and the model architecture and thus the resulting controller.

We use mini-batch stochastic gradient descent (SGD) to minimize the regularized mean squared error (MSE) loss between velocity predictions and targets as

 (6)

where is the target velocity and the predicted velocity of the th agent. We denote the mini-batch size by , the weight decay factor by , and the neural network weights – excluding the biases – by . We employ variance-preserving parameter initialization by drawing the initial weights from a truncated normal distribution according to (?). The biases of the model are initialized to zero.

The objective function is minimized using SGD with momentum (?) and an initial learning rate of which is decayed by a factor of after consecutive epochs without improvement on the hold-out validation set. We train the network using a mini-batch size , a weight decay factor , and a dropout probability of . We stop the training process as soon as the validation loss plateaus for more than ten consecutive epochs.

The raw images and velocity targets are pre-processed as follows. Before feeding the images into the neural network, we employ global feature standardization such that the entire dataset has a mean of zero and a standard deviation of one. For the velocity targets from the flocking algorithm, we perform a frame transformation from the world frame into the drone’s body frame as where denotes the rotation matrix from world to body frame for robot and corresponds to the target velocity command. We perform the inverse rotation to transform the predicted velocity commands from the neural network back into the world frame.

### 3.5 Vision-based control

Once the convolutional network is finished training, we can use its predictions to modulate the velocity of the drone. The same pre-processing steps apply to the vision-based control scenario, namely the standardization of raw images and the frame transformation of velocity predictions. Although it is not mandatory for the implementation of this algorithm, one can optionally use a low-pass filter to smooth the velocity predictions, and as a result the trajectories of the agents.

## 4 Results

This section presents an evaluation of the learned vision-based swarm controller as a comparison to the behavior of the position-based flocking algorithm. The results show that the proposed controller represents a robust alternative to communication-based systems in which the positions of other agents are shared with other members of the group. We refer to the swarm operating on only visual inputs as vision-based and the swarm operating on shared positions as position-based.

The experiments are performed using the Gazebo simulator (?) in combination with the PX4 autopilot (?) for state estimation and control. We employ the same set of flocking parameters used during the training phase throughout the following experiments (see Tab. 1). Unless otherwise stated in the text, those parameters remain constant for the remainder of the experimental analysis. The neural network is implemented in PyTorch (?). All simulations use the same random seed for repeatability.

### 4.1 Flocking metrics

We report our results in terms of three flocking metrics that describe the state of the swarm at a given time step. The measures are best described as distance and alignment-based.

The two most important metrics are the minimum and maximum inter-agent distance within the entire flock

 dmin=mini,j∈Ai

where we let denote the set of all agents, and we have because of symmetry. The minimum and maximum inter-agent distance are direct indicators for successful collision avoidance, as well as general segregation of the flock, respectively. For instance, a collision occurs if the distance between any pair of agents falls below a threshold. Similarly, we consider the flock too dispersed if the pairwise distance becomes too large.

We also measure the overall alignment of the flock using an order parameter based on the cosine similarity

 (8)

which measures the degree to which the heading of the agents agree with each other (?). If the headings are aligned, we have and in a disordered state, we have . Recall that the agent’s body frame is always aligned with the direction of motion.

### 4.2 Common migration goal

In the first experiment, we give all agents the same migration goal and show that the swarm remains collision-free during navigation. Both the vision-based and the position-based swarm exhibit remarkably similar behavior while migrating (see Figs. 3(a) and 3(c)). For the vision-based controller, one should notice that the velocity commands predicted by the neural network are sent to the agents in their raw form without any further processing.

We are especially interested in the minimum and maximum inter-agent distances, i.e. the extreme distances between any pair of agents, during migration. The minimum distance can be used as a direct measure for collision avoidance, whereas the maximum distance is a helpful metric when deciding whether or not a swarm is coherent. The vision-based controller matches the position-based one very well since they do not deviate significantly over the course of the entire trajectory (see Fig. 3(e)). If the neural controller had not learned to keep a minimum inter-agent distance, we would observe collisions in this case.

### 4.3 Opposing migration goals

In this experiment, we assign different migration goals to two subsets of agents. The first group, consisting of five agents, is assigned the same waypoint as in Sec. 4.2. The second group, consisting of the remaining four agents, is assigned a migration point on the opposite side with respect to the first group. The position-based and vision-based flock exhibit very similar migration behaviors (see Figs. 3(b) and 3(d)). In both cases, the swarm cohesion is strong enough to keep the agents together despite the diverging navigational preferences. This is the first sign that the neural network learns the non-trivial difference between agents that are too close or too far away.

We can observe slightly different behaviors between the two modalities when measuring the alignment within the flock. The position-based flock tends to be ordered to a greater extent than its vision-based counterpart (see Fig. 3(h)). This periodicity in the order of the flock stems from circular motion exhibited by the position-based agents (see Fig. 3(b)). There is less regularity in the vision-based flock, in which agents tend to be less predictable in their trajectories, albeit remaining collision-free and coherent. Note that the vision-based flock tends to be less well aligned and also reaches its migration goal far later than the position-based flock.

### 4.4 Generalization of the neural controller

We performed a series of ablation studies to show that the learned controller generalizes to previously unseen scenarios. These experiments can be seen as perturbations to the conditions that the agents were exposed to during the training phase. This allows us to highlight failure cases of the controller as well as show its robustness towards changing conditions. We change parameters such as the number of agents in the swarm or the maximum flock speed . We also remove the external migration point and the corresponding term entirely to show how the flock behaves when it is self-propelled.

The swarm remains collision-free and coherent without migration point or when the number of agents and the flock speed is changed. We could not observe a noticeable difference in the inter-agent distance or the order parameter during the experiments.

Since the vision-based controller provides a very tight coupling between perception and control, the need for interpretation of the learned behavior arises. To this end, we employ a state-of-the-art attribution method (?), which shows how much influence each pixel in the input image has on the predicted velocity command (see Fig. 5). More specifically, we compute the gradients for the heat map by backpropagating with respect to the penultimate strided convolutional layer of the neural network in which the individual feature maps retain a spatial size of pixels. We then employ bilinear upsampling to increase the resolution of the resulting saliency map before we blend it with the original input image. We attribute the relatively poor localization performance of some of the agents to the low spatial acuity of the generated heat map.

We can observe a non-uniform distribution of importance that seems to concentrate on a single agent that is located in the field of view of the front-facing camera (see Fig. 2(a)). The image is taken from an early time step during migration where the magnitudes of the predicted velocity commands are relatively large. We notice that the network is effectively localizing the other agents spatially in the visual input, albeit having not been explicitly given the positions as targets. The saliency map is generated very efficiently by computing the backward pass until the target convolutional layer and could therefore serve as a valuable input to a real-time detection algorithm.

## 5 Conclusions and Future Work

This paper presented a machine learning approach to the problem of collision-free and coherent motion of a dense swarm of quadcopters. The agents learn to coordinate themselves only via visual inputs in 3D space by mimicking a flocking algorithm. The learned controller removes the need for communication of positions among agents and thus presents the first step towards a fully decentralized vision-based swarm of drones. The trajectories of the flock are relatively smooth even though the controller is based on raw neural network predictions. We show that our method is robust to perturbations such as changes in the number of agents the or maximum speed of the flock. Our algorithm naturally handles navigation tasks by adding a migration term to the predicted velocity of the neural controller.

We are actively working on transferring the flock of quadrotors from a simulated environment into the real world. A motion capture system is used to generate the dataset of ground truth positions and real camera images, similar to the simulation setup described in this paper. A natural subsequent step will be the transfer of the learned controller to outdoor scenarios where ground truth positions will be obtained using a GNSS. To reduce the need for large amounts of labeled data, we are exploring recent advances in deep domain adaptation (??) to aid generalization of the neural controller to environments with background clutter. Another challenge is the addition of obstacles to the environment in which the agents operate. To this end, we will opt for more sophisticated flocking algorithms which allow the direct modeling of obstacles (??), as well as pre-defining the desired distance between agents (?).

## 6 Acknowledgements

We thank Enrica Soria for the feedback and helpful discussions, as well as Olexandr Gudozhnik and Przemyslaw Kornatowski for their contributions to the drone hardware. This research was supported by the Swiss National Science Foundation (SNF) with grant number 200021_155907 and the Swiss National Center of Competence Research (NCCR).

## References

• [Dousse et al. 2017] Dousse, N.; Heitz, G.; Schill, F.; and Floreano, D. 2017. Human-Comfortable Collision-Free Navigation for Personal Aerial Vehicles. IEEE Robot Autom Lett (RA-L) 2(1):358–365.
• [Faigl et al. 2013] Faigl, J.; Krajník, T.; Chudoba, J.; Přeučil, L.; and Saska, M. 2013. Low-cost embedded system for relative localization in robotic swarms. In Intl Conf Rob Autom (ICRA), 993–998. IEEE.
• [Floreano and Wood 2015] Floreano, D., and Wood, R. J. 2015. Science, technology and the future of small autonomous drones. Nature 521(7553):460–466.
• [Gandhi, Pinto, and Gupta 2017] Gandhi, D.; Pinto, L.; and Gupta, A. 2017. Learning to fly by crashing. In Intl Conf Intell Rob Sys (IROS), 3948–3955. IEEE/RSJ.
• [Ganin et al. 2016] Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Domain-adversarial Training of Neural Networks. J Mach Learn Res (JMLR) 17(1):2096–2030.
• [Giusti et al. 2016] Giusti, A.; Guzzi, J.; Cireşan, D. C.; He, F. L.; Rodríguez, J. P.; Fontana, F.; Faessler, M.; Forster, C.; Schmidhuber, J.; Caro, G. D.; Scaramuzza, D.; and Gambardella, L. M. 2016. A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots. IEEE Robot Autom Lett (RA-L) 1(2):661–667.
• [Hauert et al. 2011] Hauert, S.; Leven, S.; Varga, M.; Ruini, F.; Cangelosi, A.; Zufferey, J.-C.; and Floreano, D. 2011. Reynolds flocking in reality with fixed-wing robots: Communication range vs. maximum turning rate. In Intl Conf Intell Rob Sys (IROS), 5015–5020. IEEE/RSJ.
• [He et al. 2015] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Intl Conf Comp Vis (ICCV), 1026–1034.
• [He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Learning for Image Recognition. In Conf Comp Vis Pat Rec (CVPR), 770–778. IEEE.
• [Koenig and Howard 2004] Koenig, N., and Howard, A. 2004. Design and use paradigms for Gazebo, an open-source multi-robot simulator. In Intl Conf Intell Rob Sys (IROS), volume 3, 2149–2154 vol.3. IEEE/RSJ.
• [Krajník et al. 2014] Krajník, T.; Nitsche, M.; Faigl, J.; Vaněk, P.; Saska, M.; Přeučil, L.; Duckett, T.; and Mejail, M. 2014. A Practical Multirobot Localization System. J Intell Robotic Syst 76(3-4):539–562.
• [Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Conf Neural Info Proc Sys (NIPS), volume 25, 1097–1105.
• [Kushleyev et al. 2013] Kushleyev, A.; Mellinger, D.; Powers, C.; and Kumar, V. 2013. Towards a swarm of agile micro quadrotors. Auton Robots 35(4):287–300.
• [LeCun, Bengio, and Hinton 2015] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436–444.
• [Loquercio et al. 2018] Loquercio, A.; Maqueda, A. I.; del Blanco, C. R.; and Scaramuzza, D. 2018. DroNet: Learning to Fly by Driving. IEEE Robot Autom Lett (RA-L) 3(2):1088–1095.
• [Meier, Honegger, and Pollefeys 2015] Meier, L.; Honegger, D.; and Pollefeys, M. 2015. PX4: A node-based multithreaded open source robotics framework for deeply embedded platforms. In Intl Conf Rob Autom (ICRA), 6235–6240. IEEE.
• [Mellinger and Kumar 2011] Mellinger, D., and Kumar, V. 2011. Minimum snap trajectory generation and control for quadrotors. In Intl Conf Rob Autom (ICRA), 2520–2525. IEEE.
• [Olfati-Saber 2006] Olfati-Saber, R. 2006. Flocking for multi-agent dynamic systems: Algorithms and theory. IEEE Trans Automat Contr 51(3):401–420.
• [Paszke et al. 2017] Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; and Lin, Z. 2017. Automatic differentiation in PyTorch. In Conf Neural Info Proc Sys (NIPS),  4.
• [Preiss et al. 2017] Preiss, J. A.; Honig, W.; Sukhatme, G. S.; and Ayanian, N. 2017. Crazyswarm: A large nano-quadcopter swarm. In Intl Conf Rob Autom (ICRA), 3299–3304. IEEE.
• [Reynolds 1987] Reynolds, C. W. 1987. Flocks, Herds and Schools: A Distributed Behavioral Model. In Annual Conf Comp Graph Interactive Technol (SIGGRAPH), volume 14, 25–34. ACM.
• [Ross et al. 2013] Ross, S.; Melik-Barkhudarov, N.; Shankar, K. S.; Wendel, A.; Dey, D.; Bagnell, J. A.; and Hebert, M. 2013. Learning monocular reactive UAV control in cluttered natural environments. In Intl Conf Rob Autom (ICRA), 1765–1772. IEEE.
• [Rozantsev, Salzmann, and Fua 2018] Rozantsev, A.; Salzmann, M.; and Fua, P. 2018. Beyond Sharing Weights for Deep Domain Adaptation. IEEE Trans Pattern Anal Mach Intell (TPAMI) 1–1.
• [Sadeghi and Levine 2017] Sadeghi, F., and Levine, S. 2017. CAD2RL: Real Single-Image Flight without a Single Real Image. In Rob: Sci Sys (RSS), volume 13.
• [Saska et al. 2014] Saska, M.; Chudoba, J.; Přeučil, L.; Thomas, J.; Loianno, G.; Třešňák, A.; Vonásek, V.; and Kumar, V. 2014. Autonomous deployment of swarms of micro-aerial vehicles in cooperative surveillance. In Intl Conf Unmanned Aircraft Sys (ICUAS), 584–595.
• [Saska et al. 2017] Saska, M.; Baca, T.; Thomas, J.; Chudoba, J.; Preucil, L.; Krajnik, T.; Faigl, J.; Loianno, G.; and Kumar, V. 2017. System for deployment of groups of unmanned micro aerial vehicles in GPS-denied environments using onboard visual relative localization. Auton Robots 41(4):919–944.
• [Saska, Vakula, and Přeućil 2014] Saska, M.; Vakula, J.; and Přeućil, L. 2014. Swarms of micro aerial vehicles stabilized under a visual relative localization. In Intl Conf Rob Autom (ICRA), 3570–3575. IEEE.
• [Selvaraju et al. 2017] Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Intl Conf Comp Vis (ICCV), 618–626.
• [Smolyanskiy et al. 2017] Smolyanskiy, N.; Kamenev, A.; Smith, J.; and Birchfield, S. 2017. Toward low-flying autonomous MAV trail navigation using deep neural networks for environmental awareness. In Intl Conf Intell Rob Sys (IROS), 4241–4247. IEEE.
• [Srivastava et al. 2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J Mach Learn Res (JMLR) 15:1929–1958.
• [Sutskever et al. 2013] Sutskever, I.; Martens, J.; Dahl, G.; and Hinton, G. 2013. On the importance of initialization and momentum in deep learning. In Intl Conf Mach Learn (ICML), 1139–1147.
• [Virágh et al. 2014] Virágh, C.; Vásárhelyi, G.; Tarcai, N.; Szörényi, T.; Somorjai, G.; Nepusz, T.; and Vicsek, T. 2014. Flocking algorithm for autonomous flying robots. Bioinspiration Biomim 9(2):025012.
• [Vásárhelyi et al. 2014] Vásárhelyi, G.; Virágh, C.; Somorjai, G.; Tarcai, N.; Szörényi, T.; Nepusz, T.; and Vicsek, T. 2014. Outdoor flocking and formation flight with autonomous aerial robots. In Intl Conf Intell Rob Sys (IROS), 3866–3873. IEEE/RSJ.
• [Vásárhelyi et al. 2018] Vásárhelyi, G.; Virágh, C.; Somorjai, G.; Nepusz, T.; Eiben, A. E.; and Vicsek, T. 2018. Optimized flocking of autonomous drones in confined environments. Science Robot 3(20):eaat3536.
• [Weinstein et al. 2018] Weinstein, A.; Cho, A.; Loianno, G.; and Kumar, V. 2018. Visual Inertial Odometry Swarm: An Autonomous Swarm of Vision-Based Quadrotors. IEEE Robot Autom Lett (RA-L) 3(3):1801–1807.
• [Zufferey et al. 2011] Zufferey, J.-C.; Hauert, Sabine; Stirling, Timothy; Leven, Severin; Roberts, James; and Floreano, Dario. 2011. Aerial Collective Systems. In Kernbach, S., ed., Handbook of Collective Robotics. Pan Stanford. 609–660.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters