Scaling Local Control to Large-Scale Topological Navigation
Visual topological navigation has been revitalized recently thanks to the advancement of deep learning that substantially improves robot perception. However, the scalability and reliability issue remain challenging due to the complexity and ambiguity of real world images and mechanical constraints of real robots. We present an intuitive solution to show that by accurately measuring the capability of a local controller, large-scale visual topological navigation can be achieved while being scalable and robust. Our approach achieves state-of-the-art results in trajectory following and planning in large-scale environments. It also generalizes well to real robots and new environments without retraining or finetuning. Video could be found at https://tinyurl.com/y6h2sok4.
There has been an emergence of cognitive approaches [savinov2018semi, wubayesian, kumar2018visual, blochliger2018topomap] towards navigation thanks to the advancement of deep learning that substantially improves the visual reasoning capability of robots. Compared to the traditional mapping, localization and planning approach (SLAM) [thrun2005probabilistic, mur2015orb] that builds a metric map, cognitive navigation builds a topological map. This eliminates the need of meticulously reconstructing an environment which requires expensive or bulky hardware such as a laser scanner or a high-resolution camera. Moreover, the fact that humans are able to navigate effortlessly in large-scale environments without a metric map is intriguing. By adding this cognitive spatial reasoning capability to robots, we could potentially lower the hardware cost (i.e., using low-resolution cameras), make them work more robustly in dynamic environments and bring insights to more complex tasks such as visual manipulation.
While cognitive navigation has drawn significant attention recently, the problem remains challenging because i) it does not scale well to the size of experiences ii) it is fragile due to actuation noise and dynamic obstacles and iii) it lacks probabilistic interpretation, making it difficult to plan with uncertainty. These problems are exacerbated when using a RGB camera in indoor environments, where partial observability makes it difficult to control a robot to follow a single path [kumar2018visual, hirose2019deep].
In this paper, we present a simple and intuitive solution for topological navigation. We show that by accurately measuring the capability of a local controller, robust visual topological navigation can be achieved with sparse experiences (Fig.1). In our approach, we do not assume the availability of a global coordinate system or robot poses, nor do we assume noise-free actuation or static environment. This minimalistic representation only has two components: a local controller and a reachability estimator. The controller is responsible for local reactive navigation, whereas the reachability estimator measures the capability of the controller for landmark selection and long-term probabilistic planning. To achieve this, we leverage the Riemannian Motion Policy (RMP) framework [rmp] for robust reactive control and deep learning for learning the capability of the controller from data. We show that with both components working in synergy, a robot can i) navigate robustly with the presence of nonholonomic constraints, actuation noise and obstacles; ii) build a compact spatial memory through dynamic experience sparsification and iii) plan in the topological space probabilistically, allowing the robot to generalize to new navigation tasks.
We evaluate our approach in the Gibson simulation environment [xia2018gibson] and on a real RC car. Our test environments contain a diverse set of real-world indoor scenes with presence of strong symmetry and tight spaces. We show that our approach generalizes well to these unseen environments and surprisingly well to real robots without finetuning. Scalability-wise, our spatial memory grows only when new experiences are unseen, making the system space-efficient and compute-efficient.
Ii Related Work
Cognitive spatial reasoning has been extensively studied both in neuroscience [o1978hippocampus, hafting2005microstructure, doeller2010evidence], and robotics [thrun1998learning, milford2004ratslam, kuipers2000spatial]. The Spatial Semantic Hierarchy [kuipers2000spatial] divides the cognitive mapping process into four levels: control, causal, topological and metric. In our method, the local controller operates on the control level, whereas the reachability estimator reasons about causal and topological relationship between observations. We omit metric-level reasoning since we are not concerned about building a metric map.
Experience-driven navigation constructs a topological map for localization and mapping [milford2004ratslam, maddern2012capping, fraundorfer2007topological, cummins2011appearance, blochliger2018topomap]. Unlike SLAM that assumes a static environment, the experience graph can also be used for dealing with long-term appearance changes [linegar2015work]. However, this line of works mostly focus on appearance-based localization and ignore the control aspect of navigation, and assume that a robot can always follow experiences robustly. This does not usually hold in indoor environments, and as we will show later, it is crucial to design a good controller while considering its capability.
Semi-Parametric Topological Memory (SPTM) [savinov2018semi] is a recent work that adopts deep learning into topological navigation. Similar to SPTM, we build a topological map through past experiences. Unlike SPTM where experiences are constructed densely, we select experiences through a strict probabilistic metric. This significantly improves the scalability and robustness of our approach.
There have been recent works studying visual trajectory following that handles obstacles [hirose2019deep, bansal2019combining], actuation noise [kumar2018visual], or with self-supervision [pathak2018zero]. Our approach differs from them in that our trajectory follower extends seamlessly to probabilistic planning. Our method also handles obstacles and actuation noise well, thanks to the RMP controller that models local geometry and vehicle dynamics.
Recent works on cognitive planning [gupta2017cognitive, gupta2017unifying] show that a neural planner can be learned from data. However, assumptions such as groundtruth camera poses are available with perfect self-localization are unrealistic. The use of grid map also limits its flexibility. Another line of research uses reinforcement learning to learn a latent map [mirowski2018learning, mirowski2016learning], but it is data-inefficient and cannot be easily applied to real robots. In contrast, our planner is general and can adapt to new environments quickly. It bears a resemblance to feedback motion planning system such as LQR-Trees [tedrake2010lqr], where planning is performed on the topological map connecting reachable state spaces with visual feedback control.
We consider the goal-directed navigation problem: a robot is asked to navigate to a goal given an observation taken at . Since the environment is not fully observable, we assume robot has collected a set of trajectories from varying sources (language instructions, self explorations, etc.) as its experiences. Each trajectory is a dense sequence of observations recorded by its on-board sensor (e.g., a camera). The robot uses its experiences to decide the next action to take in order to reach . The action space is continuous and the robot could be affected by actuation noise and unseen obstacles.
We approach this problem from a cognitive perspective. By building a topological map from past experiences, the robot searches for a path, which is represented by a sequence of observations, whereby it can follow to reach .
Since a robot only has access to past observations, a necessity for a robot to act is a Local Controller . predicts an action by taking current observation and a target observation: to guide the robot towards the target. The action is executed for a small time step to get an updated and the process is repeated. should be robust to disturbances such as actuation noise and moving obstacles. Hence with , a robot can navigate to a distant location by following a planned trajectory progressively.
A practical issue is how we construct a scalable spatial memory. Using dense trajectory makes it difficult to scale to large environments due to storage and compute constraints. On the other hand, it is also undesirable to subsample a dense trajectory with fixed time interval, because depending on the environmental geometry, an observation further away may be confidently reached (e.g., in a straight hallway), whereas an observation nearby can be heavily occluded (e.g., blocked by a corner). Furthermore, sensor field of view and resolution can also affect a controller’s confidence of reaching a target.
Our intuition is that the sparsification of a trajectory must respect the capability of a controller. We propose learning a Reachability Estimator that predicts the probability of a controller successfully reaching a target:
We use as a probabilistic metric throughout the system for robust and scalable topological navigation. Fig. 2 illustrates how we build a navigation system with these two components. Given a controller , we train a corresponding reachability estimator . The incoming trajectories are first sparsified by and then interlinked to form a compact topological map. Given and , we leverage to plan a probabilistic path and use and in synergy to follow the planned path robustly.
Iii-B Designing a Robust Local Controller
To enable a local controller to counter disturbances, we split into two stages: high-level waypoint prediction and low-level reactive control. The high-level controller outputs a waypoint (in robot’s local coordinate system) for the low-level controller. The waypoint does not have to be precise, but only serves as a hint for the low-level controller to make progress. Hence is agnostic to robot dynamics and absorbs the effects of actuation noise. For the low-level controller, we adopt the RMP representation [meng2019neural] as a principled way for obstacle avoidance and vehicle control. Hence we have . The two-level design allows the same to be applied to different robots by replacing the low-level controller.
Fig. 3 illustrates the design of . The robot state is represented by its current observation . While we could use to represent the target state at index of a sequence, it is undesirable because if the robot deviates from the trajectory, the overlap between and might be too small to correct itself. Hence we incorporate the context around using its neighboring observations
where controls the context length and is the gap between two consecutive context observations. The past frames expand the field of view at , whereas the future frames encode the intention at , allowing the controller to act in advance by adjusting current waypoint (e.g., slow down to enter a door).
Technically, we extract a feature vector by feeding stacked into a sequence of convolutions, followed by combining the feature vectors through one convolution and multiple fully-connected layers to predict a waypoint . Additionally, the network predicts the heading difference between and to help the network anchor the target image in the sequence. Finally, in order to reason about proximity to a target, the network also predicts mutual image overlap. Image overlap is a ratio to represent the amount of content in one image that is visible in another image. Hence mutual image overlap is a pair of ratios . For example, suggests that the target is ahead of the robot, whereas means the robot is exactly at the target.
While our controller is trained in simulation, it generalizes well to real images because it is designed to explicitly reason about image correlation starting at the pixel level, and hence it is less prone to learning domain-specific information. Furthermore, our controller can accommodate to inputs with varying modalities and field of views due to its modular design and scalable context. We find it works well for both visual images and laser scans.
Iii-C Training the Reachability Estimator
Since serves as a reliable probabilistic metric used throughout the framework, it is critical to estimate the value accurately. Here we adopt a direct approach: training by running rollouts using . and are randomly sampled in a large number of environments and is used to drive the robot from to . Positive and negative pairs are generated by examining the outcomes so that can be trained in a supervised manner.
The design of is similar to , except that it produces a single probability. Also note that has the same contextified representation.
Iii-D Sparsifying a Trajectory
For any observation in a dense trajectory, if are sufficiently high, we could confidently discard because does not need them to reach . Hence a greedy approach to choose the next anchor is
where is previous anchor’s position and is the probability threshold that controls sparsity. Hence a dense trajectory is converted to a sequence of contextified anchor observations . One may argue that contextification reduces the effectiveness of sparsification. Since the time and space complexity is a function of the number of anchors, in practice it significantly saves computation during planning and following a trajectory, allowing our system to run on a robot in real time.
Iii-E Building a Compact Probabilistic Topological Map
Our topological map is a weighted directed graph (Fig. 8a). Vertices are anchor observations and edge weight from to is . Construction is incremental: for an incoming trajectory, we create pairwise edges between every vertex in the graph and every anchor in the new trajectory.
Compared to a graph constructed with dense observations, a graph built from sparsified observations has less than 1/10 of the vertices and 1/100 of the edges. To further improve scalability, we propose the following two optimizations to make the graph grow sublinearly to the number of raw observations, and eventually the size of the graph converges:
Edge pruning. Low-probability edges are discarded since they contribute little to successful navigation (Fig. 8b).
Vertex reuse. It is common for two trajectories to be partially overlapped and storing this overlapping part repeatedly is unnecessary. Hence when adding anchor into a graph, we check if there exists a vertex such that and . If the condition holds, we discard and add edges and , as illustrated in Fig. 8c.
The graph will converge because for any static environment of finite size, there is a maximum density of anchors. Any additional anchor will pass the vertex reuse check and be discarded. Practically however, an environment may change over time. The solution is to timestamp every observation and discard outdated observations using . We leave the handling of long-term appearance change as future work.
We add an edge (weighted by its negative log probability) from to every vertex in the graph, and from every vertex in the graph to . The weighted Dijkstra algorithm computes the path with the lowest negative log probability (i.e., the most likely path). The robot then decides whether the probability is high enough and may run the trajectory follower proposed in Section III-H.
What if there are multiple paths towards destination with probability 1.0? One tweak is to set hard edge (e.g., edges with probability 1.0) weight to be a value slightly less than 0.0 (e.g. ). In this case, the shortest path (in number of anchors) will be selected.
Iii-G Mitigating Perceptual Aliasing
In real world environments, may correspond to different locations of similar appearances. Traditional approaches usually formulate this as a POMDP problem [thrun2005probabilistic] and try to resolve the ambiguity by maintaining beliefs over states. However, the requirement of having a unique state (e.g., global pose) associated with each observation is non-applicable since we do not maintain any metric information.
We propose two techniques to resolve ambiguity. The first is to match a sequence of anchors during search and graph construction. In practice the probability of two segments having similar appearances is much lower than two single observations. Additionally we perform Online Planning. The intuition is that even if the initial path is wrong, the robot can detect the discrepancy while it is following the path. Once discrepancy is detected (e.g., the robot enters Dead reckoning state for too long. See Sec.III-H), it plans a new path using current observation as the starting point. We find these two strategies largely resolve ambiguity. In the worst case where such distinctive anchor is absent, the robot might follow a cycle of anchors without making progress. The solution to this is to count how many times the robot has visited an anchor (i.e., by collecting statistics from last visited anchor). Cyclic behavior can be detected so that the robot can break the loop by biasing its choice in future planning.
Iii-H Following a Trajectory
The trajectory follower constantly updates and tracks an active anchor to make progress, while performing dead reckoning to counter local disturbances. Specifically, given a sequence of anchor observations , the trajectory follower acts as a state machine:
Search: robot searches for an anchor that it is most likely to reach:
If , it sets as current active anchor and enters Follow state, otherwise it gives up and stops.
Follow: robot computes the next waypoint and uses it to drive . Meanwhile it tracks and updates the following two values:
last visited anchor. The robot measures the proximity between and anchors close to . The anchor with highest proximity is set as last visited anchor. This is a form of approximate localization.
active anchor. If satisfies and is within proximity, it advances to , otherwise . The intuition is to choose an that is neither too close nor too far away.
Normally a robot stays in Follow state. But if moving obstacles or actuation noise corrupt causing , it enters Dead reckoning state.
Dead reckoning: robot tracks the last waypoint computed in the Follow state and uses the waypoint to drive . The assumption is that disturbances are transient which the robot could escape by following the last successfully computed waypoint. Waypoint tracking can be done by an odometer and needs not be very accurate. While in this state, robot keeps trying two things:
follow . If , robot returns to Follow state.
search for a new . This is similar to the state but with an adaptive search radius around .
We trained and in 12 Gibson environments. 70k training trajectories were generated by running an planner (used to provide waypoints) with a laser RMP controller similar to [meng2019neural]. Simulation step size is . We use the laser RMP controller as mostly for efficiency, but in practice an image-based RMP controller can also be used [meng2019neural]. was trained by randomly sampling two images on the same trajectory with certain visual overlap, with the waypoint as supervision. After was trained, we trained by sampling two images that either belong to the same trajectory (prob 0.6) or different trajectories (prob 0.4), and ran a rollout with to get a binary outcome. Image size is with horizontal field of view. We augmented the dataset by jittering the robot’s starting location and orientation to improve generalization. About 1.5M samples were used to train and .
Our simulation models a real vehicle similar to [racecar], so that the same model can be used for real experiments. The on-board Jetson TX2 computer achieves an inference throughput of 100 Hz when running and , allowing real-time operation of our system. Due to the non-holonomic constraints of the vehicle, our setup is more challenging than most navigation systems that use turtlebots.
We present quantitative results in 5 unseen Gibson environments. Our main subject for comparison is SPTM, a recent work on deep visual topological navigation that is closely related to ours. Other works [kumar2018visual, hirose2019deep] lack planning capability, and they either require panoramic observations or assume discrete actions. Moreover, they cannot be easily extended to support sparse observations.
Iv-a Trajectory Sparsification
Fig. 6 compares sparsification results of three controllers. The two visual controllers , differ in their context length. To show that our model is general, we also trained a laser-based controller by modifying the input layer in Fig. 3 to take 64-point 1D depth as input.
Fig. 6 shows placement of anchors with . Comparing with , requires denser anchors. Since uses a shorter context, it is more “local” and has to keep more anchors to follow a path robustly. Nonetheless, anchors are more densely distributed in tight spaces and corners for both controllers, indicating that our sparsification strategy adapts well to environmental geometry. Interestingly, shows a more uniform distribution pattern. Since laser scans have a much wider field of view and measures geometry directly, it is not heavily affected by tight spaces and large viewpoint change.
Iv-B Trajectory Following
To test our trajectory follower, we randomly generated 500 trajectories in the test environments (Fig. 6) with an average length of 15 m. The condition for success is that robot reaches a destination without collision and uses at most twice the number of time steps as the ground truth trajectory. For our trajectory followers, .
Fig. 7 shows success rates under varying sparsity conditions. The first observation is that context matters. With context (Ours-k2, Ours-k5), the success rate reaches a stable high value (), whereas using single frame (Ours-k0) shows poor performance. This also suggests that a weak controller cannot be saved by a dense observation sequence.
The second observation is that a more capable controller is amenable to sparser anchors. Ours-k5 keeps its high success rate until sparsity , whereas Ours-k2 starts to degrade with sparsity . Nonetheless, by accurately measuring each controller’s capability, both perform comparably with the same .
Since the original SPTM uses dense observations, we compare ours to two variants of SPTM. SPTM-DS-* uses its retrieval network as a probabilistic metric, while keeping everything else the same as ours. We clearly see a significant performance gap. The reason is that SPTM’s retrieval network is trained with samples based on temporal gaps (20 steps as positive, 100 steps as negative). This is fragile because it is ignorant of environmental geometry and the capability of the controller. We even notice an abnormal degradation when sparsification is disabled (sparsity from 0.8 to 1.0). In comparison, the open-loop controller selects the next anchor based on time progression (i.e., using only), but with a robust local controller, it performs better than this variant of SPTM, indicating that the metric trained with temporal difference is unreliable. The second variant SPTM-TS- * uses temporal subsampling (keeping every nth frame). SPTM-TS-k5 reaches a success rate close to ours, but it still suffers from abnormal degradation with increasing observation density. We suspect the result is largely contributed by the controller, because when using a weaker controller (SPTM-TS-k0), we see a sharp drop of success rate.
We also simulated noisy actuation by multiplying a random scaling factor to the change of velocity and steering angle for model Ours-k5. No noticeable difference was found. This is expected because the local controller runs at a high frequency (10 Hz) and uses visual feedback for closed-loop control.
While our primary focus is visual navigation, we also evaluated (Ours-laser-k5). It performs comparably to our visual controllers. This indicates that our approach is general and can be applied to non-visual sensors as well.
Iv-C1 Navigation between Places
We built one topological map for each environment (Fig. 8a). A map is constructed from 90 trajectories connecting 10 locations in a pairwise fashion. The locations are selected to make the trajectories cover most of the reachable area.
A robot starts at one of the locations and is given an goal image taken at one of the other 9 destinations. Robot has no prior knowledge of its initial location. To simulate real-world operations, we injected noise to robot’s initial location and orientation. 1000 trials were collected in each environment. Since SPTM uses dense observations and has only been tested in small synthetic mazes, its original parameters do not scale to real-world navigation tasks. We set the subsample ratio of SPTM to 10 and to prevent the graph from getting too large. For our method, and .
Table I presents the success rate for each environment compared to SPTM. SPTM performs poorly in real environments due to large number of faulty shortcuts in the graph and its brittle metric, whereas ours maintains a consistently high success rate. There are two environments where ours exhibits slightly lower success rate. Upon inspection, we found the physical size of the robot and its motion constraints make it difficult to maneuver in some tight spaces, especially when it needs to plan a new path. Using a smaller robot with less motion constraints could mitigate this issue.
Iv-C2 Comparing Trajectory Probabilities to Empirical Success Rate
To show that path probability is a reasonable indicator of empirical outcome, we let a robot start at a random location (anywhere in a map), plan a path to one of the 10 destinations, and follow the path. 1000 trajectories were collected in each environment. Fig. 10 shows that path probability strongly corresponds to empirical success rate. This allows a robot to assess the risk before executing a plan, and ask for help if necessary. Note that SPTM does not provide any uncertainty measure.
Iv-C3 Generalizing to New Navigation Tasks
To test the generalizability of our planner, we randomly pruned the graphs to contain only a subset of the trajectories, and repeated the experiment in IV-C1. Fig. 10 shows that with only 60% of the trajectories, robot already performs close to its peak success rate. In other words, robot is able to combine existing trajectories to solve novel navigation tasks. Fig. 12 shows an example.
Iv-C4 Resolving Ambiguity
Fig. 12 illustrates how perceptual aliasing is resolved in environments with strong symmetry. The robot initially starts at an ambiguous location (marked “1”) and plans a wrong path (red path). While following this path, the robot detects the discrepancy at “2” by realizing what is expected to be an office room is actually a hallway. As a result, it plans a new path (green) whereby it successfully reaches the goal.
Fig. 13 shows that map sizes grow sublinearly to the number of raw observations, indicating the efficiency of our map construction method. It also shows that map size is adaptive to an environment. Since house75 exhibits more rendering artefacts, denser samples are kept to stay conservative.
Iv-D Testing in a Real Environment
Our model trained in simulation generalizes well to real images without finetuning. To map a real environment, we manually drove the vehicle to collect 7 trajectories, totalling 3,200 images. The final map contains 206 vertices and 215 edges (Fig.8b). The car is able to plan novel paths between locations and follow the path while avoiding obstacles not seen during mapping (Fig.1). We refer the interested reader to the video supplementary material for more examples.
In this work, we show that by accurately measuring the capability of a local controller, large-scale visual topological navigation can be achieved while being scalable and robust. Due to the simplicity and flexibility of our framework, it also supports non-visual sensors, and can be applied to other robotics problems. Future works include combining multiple sensors (color, depth, laser) to improve the controller, developing better algorithms to resolve ambiguity and applying our framework to manipulation tasks such as visual grasping.
This work was funded in part by ONR grant 63-6094 and by the Honda Curious Minded Sponsored Research Agreement. We thank NVIDIA for generously providing a DGX used for this research via the NVIDIA Robotics Lab and the UW NVIDIA AI Lab (NVAIL).