Towards a Robust Aerial Cinematography Platform:
Localizing and Tracking Moving Targets in Unstructured Environments
The use of drones for aerial cinematography has revolutionized several applications and industries that require live and dynamic camera viewpoints such as entertainment, sports, and security. However, safely controlling a drone while filming a moving target usually requires multiple expert human operators; hence the need for an autonomous cinematographer. Current approaches have severe real-life limitations such as requiring fully scripted scenes, high-precision motion-capture systems or GPS tags to localize targets, and prior maps of the environment to avoid obstacles and plan for occlusion.
In this work, we overcome such limitations and propose a complete system for aerial cinematography that combines: (1) a vision-based algorithm for target localization; (2) a real-time incremental 3D signed-distance map algorithm for occlusion and safety computation; and (3) a real-time camera motion planner that optimizes smoothness, collisions, occlusions and artistic guidelines. We evaluate robustness and real-time performance in series of field experiments and simulations by tracking dynamic targets moving through unknown, unstructured environments. Finally, we verify that despite removing previous limitations, our system achieves state-of-the-art performance.
In this paper, we address the problem of autonomous cinematography using unmanned aerial vehicles (UAVs). Specifically, we focus on scenarios where an UAV must film an actor moving through an unknown environment at high speeds, in an unscripted manner. Filming dynamic actors among clutter is extremely challenging, even for experienced pilots. It takes high attention and effort to simultaneously predict how the scene is going to evolve, control the UAV, avoid obstacles and reach desired viewpoints. Towards solving this problem, we present a complete system that can autonomously handle the real-life constraints involved in aerial cinematography: tracking the actor, mapping out the surrounding terrain and planning maneuvers to capture high quality, artistic shots.
Consider the typical filming scenario in Fig 1. The UAV must accomplish a number of tasks. First, it must estimate the actor’s pose using an onboard camera and forecast their future motion. The pose estimation should be robust to changing viewpoints, backgrounds and lighting conditions. Accurate forecasting is key for anticipating events which require changing camera viewpoints. Secondly, the UAV must remain safe as it flies through through new environments. Safety requires explicit modelling of environmental uncertainty. Finally, the UAV must capture high quality videos which require maximizing a set of artistic guidelines. The key challenge is that all these tasks must be done in real-time under limited onboard computational resources.
There is a rich history of work in autonomous aerial filming that tackles parts of the challenges. For instance, several works focus on artistic guidelines [joubert2016towards, nageli2017real, galvane2017automated, galvane2018directing] but often rely on perfect actor localization through high-precision RTK GNSS or motion-capture systems. Additionally, while the majority of work in the area deals with collisions between UAV and actors[nageli2017real, joubert2016towards, huang2018act], the environment is not factored in. While there are several successful commercial products, they too have certain limitations to either low speed and low clutter regimes (e.g. DJI Mavic [mavic]) or shorter planning horizons (e.g. Skydio R1 [skydio2018]). Even our previous work [bonatti2018autonomous], despite handling environmental occlusions and collisions, assumes a prior elevation map and uses GPS to localize the actor. Such simplifications impose restrictions on the diversity of real-life scenarios that these systems can handle.
We address these challenges by building upon previous work that formulates the problem as an efficient real-time trajectory optimization [bonatti2018autonomous]. In this work we make a key observation: we don’t need prior ground-truth information about the scene; our onboard sensors suffice to attain good performance. However, sensor data is noisy and needs to be processed in real-time; therefore we develop robust and efficient algorithms. To localize the actor, we use a visual tracking system. To map the environment, we use a long-range LiDAR and process it incrementally to build a signed distance field of the environment. Combining both methods, we can plan over long horizons in unknown environments to film fast dynamic actors according to artistic guidelines. In summary, our main contributions in this paper are threefold:
We develop an incremental signed distance transform algorithm for large-scale real-time environment mapping (Section IV-B);
We develop a complete system for autonomous cinematography that includes visual actor localization, online mapping, and efficient trajectory optimization that can deal with noisy measurements (Section IV);
We offer extensive quantitative and qualitative performance evaluations of our system both in simulation and field tests, while also comparing performance changes with scenarios with full map and actor knowledge (Section V).
Ii Problem Formulation
The overall task is to control a UAV to film an actor who is moving through an unknown environment. We formulate this as a trajectory optimization problem where the cost function measures shot quality, environmental occlusion of the actor, jerkiness of motion and safety. This cost function depends on the environment and the actor, both of which must be sensed on-the-fly. The changing nature of environment and actor trajectory also demands re-planning at a high frequency.
Let be the trajectory of the UAV, i.e., . Let be the trajectory of the actor, . The state of the actor, as sensed by onboard cameras, is fed into a prediction module that computes (Section IV-A).
Let grid be a voxel occupancy grid that maps every point in space to a probability of occupancy. Let be the signed distance values of a point to the nearest obstacle. Positive sign is for points in free space, and negative sign is for points either in occupied or unknown space, which we assume to be potentially inside an obstacle. The UAV senses the environment with the onboard LiDAR, updates grid , and then updates (Section IV-B).
We briefly touch upon the four components of the cost function (refer to Section IV-C for mathematical expressions). The objective is to minimize subject to initial boundary constraints .
Smoothness : Penalizes jerky motions that may lead to camera blur and unstable flight;
Shot quality : Penalizes poor viewpoint angles and scales that deviate from the artistic guidelines
Safety : Penalizes proximity to obstacles that are unsafe for the UAV.
Occlusion : Penalizes occlusion of the actor by obstacles in the environment.
The solution is then tracked by the UAV.
Iii Related Work
Camera control in virtual cinematography has been extensively examined by the computer graphics community, as reviewed by [christie2008camera]. These methods tend to reason about the utility of a viewpoint in isolation, following artistic principles and composition rules [arijon1976grammar, bowen2013grammar] and employ either optimization-based approaches to find good viewpoints, or reactive approaches to track the virtual actor. The focus is typically on through-the-lens control where a virtual camera is manipulated while maintaining focus on certain image features [gleicher1992through, drucker1994intelligent, lino2011director, lino2015intuitive]. However, virtual cinematography is free of several real-world limitations such as robot physics constraints and assumes full map knowledge.
Autonomous aerial cinematography
Several contributions on aerial cinematography focus on keyframe navigation. [roberts2016generating, joubert2015interactive, gebhardt2018optimizing, gebhardt2016airways, xie2018creating] provide user interface tools for re-timing and connecting static aerial viewpoints for dynamically feasible and visually pleasing trajectories. [lan2017xpose] use key-frames defined on the image itself instead of world coordinates.
Other works focus on tracking dynamic targets, and employ a diverse set of techniques for actor localization and navigation. For example, [huang2018act, huang2018through] detect the skeleton of targets from visual input, while others approaches rely on off-board actor localization methods from either motion-capture systems or GPS sensors [joubert2016towards, galvane2017automated, nageli2017real, galvane2018directing, bonatti2018autonomous]. These approaches have a varying level of complexity: [bonatti2018autonomous, galvane2018directing] can avoid obstacles and occlusions with the environment and with actors, while other approaches only handle collisions and occlusions caused by actors. We also observe distinct trajectory generation methods randing from trajectory optimization to search-based planners. In Table I we summarize different contributions, also differentiating onboard versus off-board computing systems. It is important to notice that prior to our current work, none of the previous approaches provided a solution for online environment mapping.
Online environment mapping
Dealing with imperfect representations of the world becomes a bottleneck for viewpoint optimization in physical environments. As the world is sensed online, it is usually incrementally mapped using voxel occupancy maps [thrun2005probabilistic]. To evaluate a viewpoint, methods typically raycast on such maps, which can be very expensive [isler2016information, Charrow-RSS-15]. Recent advances in mapping have led to better representations that can incrementally compute the truncated signed distance field (TSDF) [newcombe2011kinectfusion, klingensmith2015chisel], i.e. return the distance and gradient to nearest object surface for a query. TSDFs are a suitable abstraction layer for planning approaches and have already been used to efficiently compute collision-free trajectories for UAVs [oleynikova2016voxblox, cover2013sparse].
Visual target state estimation
Accurate object state estimation with monocular cameras is critical to many robot applications. Deep networks have shown success in detecting objects [yolov3, faster_rcnn] and estimating 3D heading [raza2018appearance, li2014] with several efficient architectures developed specifically for mobile applications [howard2017mobilenets, Zhang_2018_CVPR]. However, many models do not generalize well to other tasks (e.g., aerial filming) due to data mismatch in terms of angles and scales. Our recent work in semi-supervised learning shows promise in increasing model generalizability with little labeled data by leveraging temporal continuity in training videos [wang2019heading].
Our work exploits synergies at the confluence of several domains of research to develop an aerial cinematography platform that can follow dynamic targets in unknown and unstructured environments, as detailed next in our approach.
We now detail our approach for each sub-system of the aerial cinematography platform. At a high-level, three main sub-systems operate together: (A) Vision, required for localizing the target’s position and orientation and for recording the final UAV’s footage; (B) Mapping, required for creating an environment representation; and (C) Planning, which combines the actor’s pose and the environment to calculate the UAV trajectory. Fig. 2 shows a system diagram.
Iv-a Vision sub-system
We use only monocular images from the UAV’s gimbal and the UAV’s own state estimation to forecast the actor’s trajectory in the world frame. The vision sub-system counts with four main steps: actor bounding box detection and tracking, heading angle estimation; global ray-casting, and finally a filtering stage. Figure 3 summarizes the pipeline.
a) Detection and tracking: Our detection module is based on the MobileNet network architecture, due to its low memory usage and fast inference speed, which are well-suited for real-time applications on an onboard computer. We use the same network structure as detailed in our previous work [wang2019heading]. Our model is further trained with COCO [lin2014microsoft] and fine-tuned on a custom aerial filming dataset. We limit the detection categories to person, car, bicycle, and motorcycle, which commonly appear in aerial filming. After a successful detection we use Kernelized Correlation Filters [henriques2015high] to track the template over the next incoming frames. We actively position the independent camera gimbal with a PD controller to frame the actor on the desired screen position, following the commanded artistic principles (Fig. 5).
b) Heading Estimation: Accurate heading angle estimation is vital for the UAV to follow the correct shot type (front, back, left, right). As discussed in [wang2019heading], human 2D pose estimation has been widely studied [toshev2014deeppose, cao2017realtime], but 3D heading direction cannot be trivially recovered directly from 2D points because depth remains undefined. Therefore, we use the model architecture from [wang2019heading] (Fig. 4), which takes as input a bounding box image, and outputs the cosine and sine values of the heading angle. This network uses a double loss during training, summing both errors in heading direction and temporal continuity. The latter loss is particularly useful to train the regressor on small datasets, following a semi-supervised approach.
c) Ray-casting: Given the actor’s bounding box and heading estimate on image space, we project the center-bottom of the bounding box onto the world’s ground plane and transform the actor’s heading using the camera’s state estimation to obtain the actor’s position and heading in world coordinates.
d) Motion Forecasting: The current actor pose updates a Kalman Filter (KF) to forecast the actor’s trajectory . We use separate KF models for people and vehicle dynamics.
Iv-B Mapping sub-system
As explained in Section II, the motion planner uses signed distance values in the optimization cost functions. The role of the mapping is to register LiDAR points from the onboard sensor, update the occupancy grid , and incrementally update the signed distance :
a) LiDAR registration: The laser at the bottom of the aircraft outputs roughly points per second. We register points in world coordinates using a rigid body transform between sensor and UAV, plus the UAV’s state estimation, which fuses GPS, barometer, internal IMUs and accelerometers.
b) Occupancy grid update: We use a grid size of , with square voxels that store an -bit integer value between (free - occupied) as the occupancy probability. All cells are initialized with (unknown). Algorithm 1 covers the grid update process. The inputs to the algorithm are the sensor position , the LiDAR point , and a flag that indicates whether the point is a hit or miss. The endpoint voxel of a hit will be updated with log-odds value , and all cells in between sensor and endpoint will be updated by subtracting value . We assume that all misses are returned as points at the maximum sensor range, and in this case only the cells between endpoint and sensor are updated. Voxel state changes to occupied or free are stored in lists and . State changes are used for the signed distance update.
c) Incremental distance transform update: We use the list of voxel state changes as input to an algorithm, modified from [cover2013sparse], that calculates an incremental truncated signed distance transform (iTSDT), stored in . The original algorithm described by [cover2013sparse] initializes all voxels in as free, and as voxel changes arrive in sets and , it incrementally updates the distance of each free voxel to the closest occupied voxel using an efficient wavefront expansion technique within some limit (therefore truncated). Our problem, however, requires a signed version of the DT, where the inside and outside of obstacles must be identified and given opposite signs. The concept of regions inside and outside of obstacles cannot be captured by the original algorithm, which provides only a iTDT (no sign). Therefore, we introduced two important modifications:
i) Using obstacle borders. We define a border voxel as any voxel that is either a direct hit from the LiDAR (lines of Alg. 1), or as any unknown voxel that is a neighbor of a free voxel (lines of Alg. 1). In other words, the set will represent all cells that separate the known free space from unknown space in the map, whether this unknown space is part of cells inside an obstacle or cells that are actually free but just have not yet been cleared by the LiDAR. Differently from [cover2013sparse], our uses updates and to maintain the distance of any voxel to its closest border instead of the distance to the closest hit.
ii) Querying for the sign. We query the value of to attribute the sign of the iTSDT, marking free voxels as positive, and unknown or occupied voxels as negative.
Iv-C Planning sub-system
We want trajectories which are smooth, capture high quality viewpoints, avoid occlusion and are safe. Given the real-time nature of our problem, we desire fast convergence to locally optimal solutions rather than globally optimality taking a long time to obtain a solution. A popular approach is to cast the problem as an unconstrained optimization and apply covariant gradient descent [zucker2013chomp]. This is a quasi-Newton approach where some of the objectives have analytic Hessians that are easy to invert and are well-conditioned. Hence such methods exhibit fast convergence while being stable and computationally inexpensive.
For this implementation, we use a waypoint parameterization of trajectories, i.e., . The heading dimension is set to always point the drone from towards . We design a set of differentiable cost functions as follows:
We measure smoothness as the cumulative derivatives of the trajectory. Let be a discrete difference operator. The smoothness cost is:
where is a weight for different orders, and is the number of orders. We set . Note that smoothness is a quadratic objective where the Hessian is analytic.
Written in a quadratic form, shot quality measures the average squared distance between and an ideal trajectory that only considers positioning via cinematography parameters. can be computed analytically: for each point in the actor motion prediction, the ideal drone position lies on a sphere centered at with radius defined by the shot scale, relative yaw angle and relative tilt angle (Fig. 5):
Given the online map , we calculate the TSDT as described in Section IV-B. We adopt a cost function from [zucker2013chomp] that penalizes proximity to obstacles:
We can then define the safety cost function [zucker2013chomp]:
Even though the concept of occlusion is binary, i.e, we either have or don’t have visibility of the actor, a major contribution of our past work [bonatti2018autonomous] was defining a differentiable cost that expresses a viewpoint’s occlusion intensity among arbitrary obstacle shapes. Mathematically, we define occlusion as the integral of the TSDT cost over a 2D manifold connecting both trajectories and . The manifold is built by connecting each drone-actor position pair in time using the path .
Our objective is to minimize the total cost function (Eq. 1). We do so by covariant gradient descent, using the gradient of the cost function , and an analytic approximation of the Hessian :
This step is repeated till convergence. We follow conventional stopping criteria for descent algorithms, and limit the maximum number of iterations. Note that we only perform the matrix inversion once, outside of the main optimization loop, rendering good convergence rates [bonatti2018autonomous]. We use the current trajectory as initialization for the next planning problem.
V-a Experimental setup
Our platform is a DJI M210 drone, shown in Figure 6. All processing is done with a NVIDIA Jetson TX2, with 8GB of RAM and 6 CPU cores. An independently controlled gimbal DJI Zenmuse X4S records high-resolution images. Our laser sensor is a Velodyne Puck VLP-16 Lite, with vertical field of view and 100m max range.
V-B Field test results
Visual actor localization
We validate the precision of our pose and heading estimation modules in two experiments where the drone hovers and visually tracks the actor. First, the actor walks between two points over a straight line, and we compare the estimated and ground truth path lengths. Second, the actor walks on a circle at the center of a football field, and we verify the errors in estimated positioning and heading direction. Fig. 8 summarizes our findings.
Integrated field experiments
We test the real-time performance of our integrated system in several field test experiments. We use our algorithms in unknown and unstructured environments outdoors, following different types of shots and tracking different types of actors (people and bicycles) at both low and high speeds in unscripted scenes. Fig. 9 summarizes the most representative shots, and the supplementary video (https://youtu.be/ZE9MnCVmumc) shows the final footages along with visualizations of point clouds and of the online map.
We summarize runtime statistics in Table II, and discuss online mapping details in Fig. 7. While the vision networks takes up a large part of the system’s RAM, CPU usage is fairly balanced accross systems.
|Thread (%)||\pbox2.0cmRAM (MB)||\pbox2.0cmRuntime (ms)|
V-C Performance comparison with full information knowledge
An important hypothesis behind our system is that we can operate with insignificant loss in performance using noisy actor localization and a partially known maps. We compare our system with three assumptions from previous works:
Ground-truth obstacles vs. online map
We compare average planning costs between results from a real-life test where the planner operated while mapping the environment in real time with planning results with the same actor trajectory but with full knowledge of the map beforehand. Results are averaged over 140 s of flight and approximately 700 planning problems. Table III shows a small increase in average planning costs, and Fig 10a shows that qualitatively both trajectories differ minimally. The planning time, however, doubles in the online mapping case due to extra load on CPU.
|time(ms)||\pbox1.5cmAvg. cost||\pbox1.5cmMedian cost|
|Noise in actor||30.2||0.1276||0.0953|
Ground-truth actor versus noisy estimate
We compare the performance between simulated flights where the planner has full knowledge of the actor’s position versus artificially noisy estimates with 1m amplitude. Results are also averaged over 140 s of flight and approximately 700 planning problems, and are displayed on Table III. The large cost difference is due to the shot quality cost, which relies on the actor’s position forecast and is severely penalized by the noise. However, if compared with the actor’s ground-truth trajectory, the difference in cost would be significantly smaller, as seen by the proximity of both final trajectories in Fig 10b. These results offer insight on the importance of our smoothness cost function when handling the noisy visual actor localization.
Height map assumption vs. 3D map
As seen in Fig. 9c, our current system is capable of avoiding unstructured obstacles in 3D environments such as wires and poles. This capability is a significant improvement over our previous work [bonatti2018autonomous], which used a height map assumption.
We present a complete system for autonomous aerial cinematography that can localize and track actors in unknown and unstructured environments with onboard computing in real time. Our platform uses a monocular visual input to localize the actor’s position, and a custom-trained network to estimate the actor’s heading direction. Additionally, it maps the world using a LiDAR and incrementally updates a signed distance map. Both of these are used by a camera trajectory planner that produces smooth and artistic trajectories while avoiding obstacles and occlusions. We evaluate the system extensively in different real-world tasks with multiple shot types and varying terrains.
We are actively working on a number of directions based on lessons learned from field trials. Our current approach assumes a static environment. Even though the our mapping can tolerate motion, a principled approach would track moving objects and forecast their motion. The TSDT is expensive to maintain because whenever unknown space is cleared, a large update is computed. We are looking into a just-in-time update that processes only the subset of the map queried by the planner, which is often quite small.
Currently, we do not close the loop between the image captured and the model used by the planner. Identifying model errors, such as actor forecasting or camera calibration, in an online fashion is a challenging next step. The system may also lose the actor due to tracking failures or sudden course changes. An exploration behavior to reacquire the actor is essential for robustness.
We thank Mirko Gschwindt, Xiangwei Wang, Greg Armstrong for the assistance in field experiments and robot construction.