Towards Target-Driven Visual Navigation in Indoor Scenes via Generative Imitation Learning

Towards Target-Driven Visual Navigation in Indoor Scenes via Generative Imitation Learning


We present a target-driven navigation system to improve mapless visual navigation in indoor scenes. Our method takes a multi-view observation of a robot and a target as inputs at each time step to provide a sequence of actions that move the robot to the target without relying on odometry or GPS at runtime. The system is learned by optimizing a combinational objective encompassing three key designs. First, we propose that an agent conceives the next observation before making an action decision. This is achieved by learning a variational generative module from expert demonstrations. We then propose predicting static collision in advance, as an auxiliary task to improve safety during navigation. Moreover, to alleviate the training data imbalance problem of termination action prediction, we also introduce a target checking module to differentiate from augmenting navigation policy with a termination action. The three proposed designs all contribute to the improved training data efficiency, static collision avoidance, and navigation generalization performance, resulting in a novel target-driven mapless navigation system. Through experiments on a TurtleBot, we provide evidence that our model can be integrated into a robotic system and navigate in the real world. Videos and models can be found in the supplementary material.

I Introduction

The last decade has seen significant achievements in the field of autonomous navigation technologies, starting with motion planning given a geometric model of the environment [1], then progressively integrating automation technologies into robots to assist navigating in explored scenes [2, 3, 4]. However, the autonomous mobility of robots is still limited in an unexplored scene with a new navigation task, which greatly limits the mobile robot application in many tasks, including household service and restaurant delivery.

Traditionally, robotic navigation methods consist of two parts. First, a geometric map is built using mapping techniques, such as Simultaneous Localization and Mapping (SLAM) [5]. Next, a collision-free path in workspace or configuration space is sought with respect to the map using path planning algorithms, such as Probabilistic Roadmaps (PRM) [6] and Rapidly Exploring Random Trees (RRT) [7]. However, these methods are highly sensitive to robot odometry and noise in sensor data. Representations constructed by SLAM systems are prone to error when the environment changes over time. Motion planning approaches often assume perfect localization and rely on high-quality geometric maps of the environment. More importantly, these methods ignore the rich information from on-board visual sensors of robots limiting the use of these methods to target-driven navigation (i.e., autonomously navigating to a semantic object).

Given the above limitations, deep learning-based mapless navigation approaches have gained considerable attention recently. These methods do not rely on prior knowledge of surroundings. They predict navigation actions directly from visual observations of robots based on end-to-end learning, including Imitation Learning (IL) [8, 9, 10] and Reinforcement Learning (RL) [11, 12, 13, 14]. IL based navigation requires the optimal demonstration from experts and has the advantage of fast learning of useful information [10]. RL based navigation does not specifically require supervision by an expert, as it searches for an optimal policy that finally leads to the highest reward. However, it generally requires abundant training data to converge, suffers from sparse rewards in navigation episodes, and struggles to generalize to unseen scenes with new targets.

Fig. 1: Target-driven navigation takes as input the current and the target observations and outputs an action that would lead to the target. We compare the following navigation models: (a) Baseline navigation policy; (b) Generative navigation policy; (c) Our integrated navigation pipeline.

In this paper, we focus on exploring supervised methods (in particular, imitation learning) to bring better cross-scene and cross-target generalization to target-driven visual navigation of robots. Given the sequence of observations and actions from a demonstration task, our navigation policy learns how to reach a target by imitating the expert demonstration step-by-step. One critical challenge in learning the navigation policy is that, in general, there may be multiple possible ways of going from the current location to the target: that is, the distribution of trajectories between states is multi-modal [15]. We address this issue with our novel variational generative module based on the idea of conceiving the next expected observation (NEO) before making an action decision. To operationalize this, we first learn a generative model conditioned on the multi-view observations at the current location as well as the target image, from which the NEO can be generated. We predict the next action based on the difference between the generated NEO and the current (front-view) observation. See Figure 1(b) for a schematic illustration of the framework. This framework has the effect of transferring the multi-modality of navigation action prediction to the generation of NEO, making the progress of action prediction a surjection instead of a multi-modal action distribution learning, thus greatly enhancing data efficiency. In addition, our NEO generation essentially models the forward dynamics of the agent-environment interaction, i.e. action-driven state transition. This enables multiple ground truth actions to take effect in the generation of NEO, improving the expressiveness of our generation module, and thus facilitating the generalization performance.

In addition, we also incorporate recent insights relating network conditioning to navigation performance. Considering the learning efficiency and computation configuration, we explore different network architectures and finally present the architecture with a good tradeoff. To account for the static collision during robot navigation, we propose jointly optimizing the navigation policy with a premature collision prediction that evaluates the collision probability of all possible actions for every current position. Furthermore, we also design a target checking module for deciding whether the robot has reached the target. We will show that our method jointly trained with the module consistently outperforms the baseline, which augments the action space with a stop action.

In summary, our contributions are as follows: (1) We present a navigation pipeline (see Figure 1(c)) for navigating to novel targets in unexplored scenes using only the current visual observation and the target image, without relying on any location services at runtime. (2) We integrate a variational generative model into navigation policy learning, which strengthens the connection between robotic observation and navigation actions and helps alleviate the multi-modality in action decision making. (3) We propose adding a premature collision prediction module downstream of the convolutional neural network of our original architecture to provide a strong learning signal that encourages learning of useful features for both navigation tasks and static collision avoidance. (4) We design a target checking module in response to an optimization on our original architecture, which has the policy model output a stop action when a robot is close to a target position. Integrating with the module, our method demonstrates better navigation performance.

A preliminary version of the navigation model is presented in [16], which proposes a generative model for visual navigation. In this work, we have significantly extended the idea behind the model design by taking into account the multi-modality during navigation decision making, which is generally an important factor that affects the performance of navigation policy learning. In addition, we investigate three techniques to improve robot navigation performance in the real world, including feature space dynamics, premature collision prediction and additional target checking. We show that the proposed method significantly outperforms prior work [16], boosting the success rate from to and reducing approximately of the collisions for a navigation task in the unseen scenes from the Active Vision Dataset [17]. Furthermore, we steer a wheeled robot, TurtleBot, around office scenes and show that the learned navigation policy can generalize to novel targets in unseen real-world environments. Demonstration videos and the code can be found in the supplementary material.

The remainder of this paper is organized as follows. In Section II, we review the relevant background literature. Section III describes the target-driven visual navigation problem. In Section IV-A, we pose and solve the problem by integrating a variational generative model into navigation policy learning. Section IV-B presents three techniques to facilitate learning. Section V provides an exhaustive experimental validation of our designs. We conclude in SectionVI with a summary and a discussion of future work.

Ii Related Work

The task of learning an agent (e.g. ground vehicle, UAV, or mobile robot) to physically navigate through an unknown environment has been approached either through reinforcement learning (RL) or imitation learning (IL). In this section, we provide a brief review of these related learning strategies for the sequential decision making problem of navigation.

Reinforcement Learning. Reinforcement learning has achieved state-of-the-art results in different fields by directly maximising cumulative reward without counting on expert supervision. Recently, a growing number of methods have been reported for RL-based navigation [18, 19, 20]. For example, Zhu et al. [3] propose an architecture for target-driven visual navigation by combining a Siamese network with an A3C algorithm. Ye et al. [21] focus on learning policies for robots to allow object searching and reaching. However, neither work considers the generalization to previously unseen environments. Work in [22] provides several additional RL learning strategies and associated architectures. Wu et al. [23] focus on room navigation, in which an agent learns to understand a given semantic room concept and finally navigate to the target room. The method shows strong result in some unseen environments of House3D. Wortsman et al. [18] propose a self-adaptive visual navigation (SAVN) method which learns to adapt to new environments on AI2-THOR without considering the generalization to novel targets. Furthermore, many recent works have extended deep RL methods to real-world robotics applications by either collecting an exhaustive real-world dataset of simple maze-like environments under grid world assumptions [24], or directly transferring a navigation model in simulation to real maze environments [25]. Anderson et al. [26] and Savva et al. [27] design RL-based agents for point navigation in realistic cluttered environments, which require an idealized GPS and the specific location of the goal at runtime. We also evaluate our target-driven navigation model on real-world complex scenes, each containing visually and structurally different observations, but without relying on any maps and localization devices.

Inverse Reinforcement Learning. Inverse RL has recently been the most commonly used method [28, 29]. The DAGGER model [30] proposes continuously closing the trajectory distributions from the agent and the expert demonstration and has been widely used for many robotic control tasks. To avoid directly interacting with the expert as in DAGGER, Ho et al. [31] design a generative adversarial model to fit distributions of states and actions defining expert behaviors. These methods demonstrate higher sample efficiency and generalization than many classical RL methods. Ziebart et al. [32] propose the maximum entropy inverse reinforcement learning (MaxEnt IRL), which is computationally efficient on a routing problem (mission planning). You et al. [33] learn the optimal driving strategy using inverse reinforcement learning based on the demonstrations from expert drivers, which demonstrates desired driving behaviors in some simulation environments. Xia et al. [34] focus on a specific navigation task and propose learning the underlying rewards from expert demonstrations under the framework of inverse RL. The navigation target information is hard-coded in the neural networks, which does not support the cross-target generalization. In visual navigation, Gupta et al. [35] present an end-to-end architecture based on DAGGER to jointly train mapping and planning for navigation in novel environments. One limitation of the work is the assumption of perfect odometry, which is not accessible in the real world. We propose a target-driven navigation system without relying on any topological maps or location measurements.

Imitation Learning. Imitation learning (IL) aims to mimic human behavior by learning from demonstrations  [36, 37, 10]. Richter et al. [38] use a conventional feed forward neural network to predict collisions for robotic navigation based on images observed by the robot, which relies on an odometry and the preset goal location. Codevilla el al. [39] propose a framework that learns sub-policies using a multi-headed network in the autonomous driving setting. In [40], the authors propose a deep multi-task shared imitation learning framework, SMIL, that can learn to work on multiple robotics tasks with multiple sub-policies. Mousavian et al. [17] learn to predict the cost of an action, which is supervised by the shortest paths of navigation tasks. Pfeiffer et al. [41] leverage prior expert demonstrations for pre-training of laser-based navigation policy. Pathak et al. [15] learn an inverse dynamics model based on the demonstrated trajectory way-points from the expert, which requires several intermediate sub-goals for a long-range navigation task. Watkins et al. [10] train an agent to navigate to any position via direct behavioral cloning from pre-generated expert trajectories, given a panoramic view of the goal and the current visual input. However, an environment map should be given when generalizing to unseen environments. In contrast to this work, we focus on both cross-target and cross-scene generalization for navigation and propose conceiving the next observation before acting and other techniques for optimization that make a more effective and generalizable navigation model.

Iii Problem Formulation

The goal of target-driven navigation is to learn a controller, which enables a robot to autonomously and safely navigate to a target in an unexplored scene, without providing any map, odometry, GPS or relative location of the target but only RGB or Depth input from on-board visual sensors. The navigation target is described by an image, which is also specified as an input to our model. Hence, for testing, a mobile robot with our model can navigate to new targets without re-training.

To achieve this, learning needs to be done through repeated interactions between an agent and an environment . At every time step , the agent receives an observation from and then performs an action within the available action space based on its current policy , where is the target. Subsequently, the agent transfers to a new observation within the observation space under the environment transition distribution . After repeating this process, the agent generates a trajectory , also named an episode. An episode can end when the agent acts for a certain number of time steps or reaches the target.

Fig. 2: (a) The robotics system setup. (b) Our navigation policy mainly consists of the four components in yellow squares. Symbols in solid circles denote the input and symbols in dotted circles represent the supervision from an expert to help update the parameters of the proposed policy.

We configure four RGB or Depth cameras to have a panoramic field of view at each time step. They are arranged at , , and horizontally, covering vertical fields of view (see Figure 2(a)). The target is consistent with the observation in terms of image data modality. We define a set of control commands: . The rotate ccwcw action indicates turning the agent in place leftright and the move action moves the agent a settled distance (e.g., ). In our setting, an episode is terminated when the stop action is executed, or the maximum number of steps, , is reached. A successful episode means the agent issues the stop action exactly when it reaches the goal (the distance to the target is less than and the angle between the current and the target view directions is less than ) within steps. Our pipeline is fully automated and does not require human intervention in unknown scenes with new targets.

Iv Target-Driven Navigation Model

The goal of the target-driven navigation model is to generate action sequences which are as close as possible to what human would have done in the same situation. In this work, we use a combination of variational generative model and imitation learning to learn a reactive navigation policy, which has shown to outperform some state-of-the-art methods in the context of mapless navigation. In what follows, we describe how we learn the navigation policy and some additional techniques to facilitate the performance.

Iv-a Navigation Policy

Let , be the sequence of observations and actions generated by the agent as it navigates to a target . The data is used to learn the target-driven navigation policy , which takes as input a pair of views and outputs an action required to approach the target observation from the current observation . We first present by a reactive deep network, which is trained by minimizing a cross-entropy loss as:


where is the ground-truth action distribution. To minimize the loss, it is common to assume as a delta function at a ground truth action. However, this assumption is notably violated, since there can be multiple ground truth actions for and is inherently multi-modal. When navigation trajectories are longer, more paths may take the agent from the current observation to the target observation, leading to a more difficult multi-modality issue. Previous works on imitation learning [10, 38] typically assume to be a delta function, which leads to high-variance in gradients during learning and in turn would make learning challenging. Recent deep reinforcement learning models [3, 23] require abundant samples to obtain a good empirical estimate of a multi-modal action distribution . We account for multi-modality by employing a variational generative process. Instead of learning the complex function from visual observations to action directly, we propose first learning to generate the next observation (NEO) from and then learning the mapping from to . The mapping is a surjection, which means there is only one appropriate action for . In this way, the multi-modality essentially affects the generation of , which is learned by a generative module as [16].

Generative Module. Given the current observation , we first model the environment transition dynamics as:


where is a parametric model of the joint distribution over the NEO and a latent variable . To learn the generative model , one typically maximizes the marginal log-likelihood . Since the next action is unknown a priori and is inherently determined by the target , we apply variational inference and introduce a distribution with parameters that approximates the true distribution . Then we obtain the marginal likelihood of the model:


To maximize the marginal likelihood, we maximize its lower bound:


where denotes the Kullback-Leibler divergence between two distributions, and . We design a generative module for maximizing the lower bound, in which , , and are all parameterized by neural networks.

During training, the navigation tasks to be imitated are provided with a series of ground truth trajectories, e.g., , which are captured using Dijkstra algorithm. Therefore, can be estimated as a Gaussian distribution conditioned on the current observation and the ground-truth action , leading to a mixture-of-posteriors prior imposed on the latent distribution for the multi-modality in the generation of the next observation. By minimizing the divergence, the two distributions, and , get close to each other, which propels the generation of the next observation, , to be in favour of the navigation task and consistent with the environment transition dynamics meantime. In addition, we empirically approximate using samples , that are obtained by the agent after executing at . From the above, the loss for our generative module is:


Predictive Control. Further, to realize robot navigation, we learn a navigation action controller , which predicts the next best action based on the current observation , the generated next expected observation as well as the previous action . Note that inputting the previous action to our model at each time step could be promising when an agent runs back and forth in a scene. Given the ground truth action , the controller is trained by minimizing the standard cross-entropy loss as:


Integrating the predictive control with the generative module (see Figure 2), the objective of our navigation policy becomes:


where the hyper-parameter tunes the relative importance of the reconstruction term, the KL divergence term and the predictive control term. The three hyper-parameters are empirically set as , , and throughout our experiments.

Iv-B Techniques to Facilitate the Navigation

We also investigate three techniques to improve robot navigation performance in the real world. First, we learn the environment dynamics in the feature space as opposed to the raw observation space. Furthermore, we propose a premature collision prediction module to improve the safety during navigation. Finally, a target checking module is also designed to issue the stop when the agent is near the target.

Feature Space Dynamics. [42] and [43] have proposed improving the generalization of learning models by learning forward dynamics in the feature space instead of raw observation space. Following this, we extend our navigation model to make predictions in feature representations of raw observations. We apply a CNN module to derive a feature representation from an observation and hence get the current feature , the ground truth next state feature , and the target feature . We have conducted some experiments on evaluating the choice of the CNN module, e.g., the sophisticated convolutional layer in ResNet [44] and VGG [45] (see Section V-B). Considering both the efficiency and the navigation performance, we design our CNN module in Figure 3(a). The module can compress an RGB or Depth image into a -D feature space. Spectral normalization is used for the first four convolutional layers, which can prevent the escalation of parameter magnitudes and avoid unusual gradients in training [46]. The activation function used is LeakyReLU .

In addition, we directly use the feature after two fully connected (FC) layers of , denoted as , to help the predictive navigation control and update the reconstruction term in Equation 7. The final objective with feature space dynamics is as follows:


Premature Collision Prediction. We propose incorporating an auxiliary module into our navigation policy in order to promote more robust learning, and ultimately safer navigation performance for our agent. This auxiliary module is a multilayer perceptron (MLP) downstream of the CNN module of our navigation policy, which provides the collision probability of all actions in over the current four-view observation (see Figure 3(b)). We refer to this as the premature collision prediction module, , leading to a multi-label classification loss term, which is specified as follows:


where is a delta function at , which is provided by the interaction between the agent and the environment.

Given the state representation of the current observation, , the auxiliary module can be summarized as predicting the collision probability of all possible actions at this time step. The module shares the same CNN module as the navigation policy network. We believe this forces the CNN module to learn low-level representations that are useful for both the navigation task and collision avoidance.

Fig. 3: The architectures of our (a) CNN module, (b) premature collision prediction module, (c) target checking module.

Target Checking. The target checking module is especially critical for robot navigation in the real world, which enables a robot to figure out if the current target is reached. This process is simple given knowledge of the true physical state, but difficult when working with visual observations. Aside from the usual challenges of visual recognition, the significant training data imbalance further complicates the target checking task, since we only have one positive example of stop action at the end of each trajectory, while all the other steps are negative examples for not stop.

We pose the target checking as a binary classification problem and design a target checking module that takes in the current four-view observation concatenated with the target image and predicts whether the agent reaches the target position, denoted as . The target checking module is jointly trained with our navigation policy. It shares the CNN and the feature fusion parts with our navigation policy, which provides a -D fused feature vector at each time step. The vector finally passes through a MLP to output the probability of the target being reached. A detailed topology of the module is pictured in Figure 3(c). Similar to [16], for alleviating the effect of data imbalance, we guarantee that approximately training tasks are near navigation targets, of which the optimal action is stop at the start location. We train the target checking module using a binary cross-entropy loss defined as:


where is a delta function at , which is supervised by the environment.

Subsequently, we introduce and to integrate both the premature collision prediction module and the target checking module into our navigation policy learning. The two weights control the strength of the two auxiliary loss terms. We experiment with several different constants and are finally determined. The goal of automatically computing the optimal weights for an arbitrary environment is a good topic for future work. Our overall navigation objective is given:


At test time, our model outputs a navigation command given the current observation, the target view and the previous action progressively. This will drive the robot toward the eventual target while avoiding some static obstacles in unseen scenes, without relying on any maps or location services.

V Experiments and Discussions

We evaluate our model by testing on both synthetic and real-world D navigation tasks. We present our navigation performances in the context of the different choices we made in our design, as well as comparing with some alternate methods. The key characteristic of a good navigation policy is that it generalizes to unseen scenes and new targets while remaining robust to irrelevant parts of visual observations.

V-a Experimental Setup

Alternatives. We compare the navigation performance to the following alternative models: (1) NeoNav is our previous work [16]. (2) TD-A3C is the abbreviation of the baseline [3]. To evaluate the generalization to different targets in unknown scenes, we just keep one scene specific layer of the network and train on all training scenes. (3) TD-A3C-IL incorporates IL into TD-A3C for the sample-inefficiency in RL. Furthermore, for a fair comparison, the variant is facilitated with CPM and TCM, and uses the same CNN module and input as ours. (4) G-LSTM-A3C-IL is a variant of TD-A3C-IL using more advanced architectures with LSTM from [23]. (5) GSP is a goal-conditioned skill policy in [15], which learns an inverse dynamics model based on some demonstrated way-points from an expert and predicts the next state feature as an auxiliary task for control. We reimplement the work based on their provided code1 and train it in our setup for a fair comparison. (6) LSTM-A3C-KG-A [47] uses knowledge graph and attention mechanism both to form spatial reasoning and guide policy search.

Implementation Details. Our model is trained and tested on a PC with Intel(R) Xeon(R) W- CPU, GHz and a Geforce GTX Ti GPU. We use an RMSprop optimizer [48] to update our model with a learning rate of and a smoothing constant of . During training, each update is based on time steps from random trajectories, each of which is generated by randomly selecting a scene, a start and a target from our training split.

Evaluation Metrics. When sampling evaluation tasks, we consider the ratio of the shortest path distance to the Euclidean distance between the start and goal positions of a task, proposed by [27] to benchmark navigation task difficulty. In each evaluation, we compute the percentage (lower is more challenging) of the tasks that have a ratio within the range of . In addition, we adopt two main evaluation metrics in our experiments: success rate (SR) and success weighted by path length (SPL) [49]. For each of the navigation evaluation tasks, let be a binary indicator for successful navigation or unsuccessful navigation. and denote the length of the shortest path and the actual executed path of the -th task, respectively. Success rate is the fraction of tasks in which the agent reaches the target successfully within limited time steps: . SPL considers both the success indicator and the length of the executed path: . The higher this value, the faster, on average, the agent approaches the target.

V-B D Navigation in AVD

We first conduct our experiments based on the training and testing splits of AVD [17]. The input visual resolution is . For each training scene, we choose fifteen different views as the targets by default, each of which contains a common object, such as a dining table, a sofa, a television, etc. The start position of a navigation agent can be randomly sampled across the scene. The training times of our model, NeoNav, TD-A3C-IL, G-LSTM-A3C-IL, and GSP are all about hours for RGB or depth input. TD-A3C requires double the training time. For evaluation, two kinds of settings are considered here, and . Each evaluation contains different navigation tasks (). The target views of these evaluation tasks, which are different from the training target views, are randomly sampled.

Ablations. We first evaluate how the performance is affected by changing the input modality of our model. We train our model with RGB inputs which leads to the SR/SPL (in ) of and the model with RGBD inputs leads to the SR/SPL of on the evaluation from unseen environments. Compared to RGB, depth images contain rich geometry information which benefits a powerful reasoning about the surrounding layout, leading to better navigation policies (). However, learning the effective combination of features from RGB and depth images may further complicate the navigation decision making, resulting in worse performance. Hence, unless explicitly stated otherwise, we use depth by default.

An ablation study on the CNN backbone is provided based on the tasks from unseen environments. The navigation performance (SR and SPL in %) of our re-implementations of VGG and ResNet are: and , respectively. These are similar to Ours () with the CNN design in Section IV-B. However, the training time of VGG is at least twice that of our algorithm and the training of ResNet is three times longer than training our current architecture. Hence, we suggest our design considering both the navigation performance and the training time.

Our navigation policy exploits a combination of a variational generative model and imitation learning and is augmented with two auxiliary modules, including premature collision prediction, and target checking. We systematically ablate the components to quantitatively review the importance: (1) Ours-NoVG removes the variational generative module and predicts navigation actions directly based on the current observation and the target. (2) Ours-NoCP predicts navigation actions without prematurely considering a collision at each step. (3) Ours-NoTC learns to output a stop action by the policy rather than a target checking module.

As shown in Table I, our navigation pipeline shows 2 reduction in average collisions, improvement in average SR, and improvement in average SPL over our prior navigation algorithm (NeoNav) [16], which demonstrates the effectiveness of our novel designs in unseen scenes. Ours-NoVG ignores the multi-modality during navigation by directly learning the complex connection from visual observations to actions, which is difficult to generalize to new navigation tasks. Ours-NoCP demonstrates worse static collision avoidance during navigation. The comparison between Ours-NoTC and Ours shows that the training data imbalance problem significantly affects robot navigation learning, which can be alleviated by additional target checking. We also visualize navigation trajectories of these models in Figure 4. Ablation models fail to reach both the targets. In contrast, our proposed agent performs best in terms of both path quality and navigation success.

Environment Model SR SPL Collisions
NeoNav [16] 17.5 4.4 57.5
Unseen Ours-NoVG 22.6 8.4 50.1
Environments Ours-NoCP 27.1 8.4 56.2
P=15.0% Ours-NoTC 19.8 6.3 51.3
Ours 28.7 8.8 48.1
TABLE I: Ablation study on model structure based on the average navigation performance (SR and SPL in %, average collisions for a task) on AVD with depth input.
Fig. 4: Visual comparison of navigation paths. Blue dots represent the reachable positions in the scenes. Black stars and red stars denote starting and goal points, respectively. Ours, NeoNav, Ours-NoVG, Ours-NoCP, and Ours-NoTC choose the black, the yellow, the magenta, the cyan and the green paths, respectively. Triangles in different colors represent end points of different models. Some triangles can be overlapped with others (e.g., the black and the yellow triangles both overlap with the cyan triangle in the second task). Only our agent successfully navigates to both the targets.
Environment Seen Unseen
Target Views 120 120 240 360
Random 1.4 / 0.8 2.8 / 1.8 2.8 / 1.8 2.8 / 1.8
TD-A3C [3] 26.3 / 7.2 6.4 / 3.4 7.9 / 4.1 8.1 / 4.0
TD-A3C-IL [3] 47.2 / 20.1 20.7 / 6.7 23.4 / 7.1 26.5 / 8.6
G-LSTM-A3C-IL [23] 39.6 / 11.2 20.3 / 6.1 22.3 / 6.3 27.2 / 7.4
GSP [15] 52.3 / 25.3 23.4 / 6.7 25.1 / 9.7 31.8 / 10.7
Ours 49.3 / 23.6 28.7 / 8.8 31.6 / 10.3 33.3 / 12.0
TABLE II: Navigation performance (SR and SPL in %) for different number of training target views from AVD with depth input.

Comparisons. Table II summarizes the results of our proposed model and the alternatives. All learning models get better performances when tested on seen scenes than on unseen scenes. The performance degrades drastically for both the baselines and our proposed models in unseen scenes. This indicates that all models do not have a deep understanding of navigation tasks and environments. We assume this is because the accessible scenes are limited and highly discretized during training, impeding the understanding of the real indoor environments characterized by high complexity and continuity. We also evaluate the navigation performance of these models, when trained with RGB inputs (see Table III). We find that depth information consistently improves the navigation performance for all models in unseen environments.

Seen Unseen
Random 1.4 0.8 2.8 1.8
TD-A3C 20.7 5.1 5.1 2.9
TD-A3C-IL 45.2 19.7 18.2 5.9
G-LSTM-A3C-IL 31.7 9.5 17.9 5.5
GSP 47.9 14.3 19.3 5.5
Ours 54.6 23.5 21.3 6.9
TABLE III: Average navigation performance (SR and SPL in %) comparisons on AVD with RGB inputs.

In addition, our proposed navigation pipeline generally outperforms these methods in terms of both path quality and success rate. TD-A3C is originally designed for scene-specific policy learning and thus lacks the generalization ability to unseen scenes. Moreover, dealing with sparse rewards is challenging in RL. TD-A3C-IL that combines IL, the proposed CPM and TCM together achieves significantly higher performance than the pure A3C method. This indicates that imitation learning and some proposed designs have a significant impact on accelerating the learning rates of navigation agents. The model, G-LSTM-A3C-IL, a direct application of LSTM on TD-A3C-IL, does not yield sensible performance due to the limited training data. Moreover, TD-A3C-IL and G-LSTM-A3C-IL both learn the complex function directly from visual observations to navigation actions, which is tough. This is due to the multi-modality of navigation actions, leading to the weak correlations between visual observations and actions. GSP addresses the multi-modality with their novel forward consistency, which makes the GSP-predicted action consistent with the ground-truth action both leading to the next state that benefits a navigation task. In contrast, our method learns to imagine the next observation from the current observation and the target, and then learns the mapping from the difference between the imagined and the current observations to the navigation action. This transfers the multi-modality to the generation of NEO, disposed by a variational generative module, and keeps the navigation action prediction process a surjection, which guarantees the strong correlation. Figure 5 shows the agent trajectories by these models for two navigation tasks from unseen scenes. Only our model successfully navigates the agent to the targets in these two cases.

Fig. 5: Visual comparison of navigation paths. Blue dots represent the reachable positions in the scenes. Black stars and red stars denote starting and goal points, respectively. Triangles in different colors represent end points of different models. TD-A3C, TD-A3C-IL, G-LSTM-A3C-IL, and GSP choose the magenta, the green, the cyan and the yellow paths, respectively. Our agent takes the black paths and is able to successfully navigate to the goals in the two cases.

We also evaluate the navigation performance improvements of these models, when trained on increasing numbers of target views from the training split in Table II. The evaluation is based on the navigation tasks from the unseen environments. As can be seen, all the models show increasing SRs and SPLs with increasing numbers of training target views. GSP presents a faster rate of growth than others, indicating that having more targets is advantageous for improving the learning capability of agents. Furthermore, our model invariably achieves the best results which indicates the data efficiency of the whole architecture.

In all the experiments, when a static collision occurs, a navigation agent will make a new action decision which may guide the agent out of the dilemma or have it stay put until running out of time, e.g., steps. We evaluate the collision avoidance capability of these models by computing the ratio of collisions as the navigation proceeds based on the navigation tasks from unseen environments with depth inputs. As shown in Figure 6, TD-A3C has the worst performance. This maybe due to the entropy regularisation penalty during training [50], which improves the exploration ability of the model leading to less attention on the obstacles in the environments. TD-A3C-IL, G-LSTM-A3C-IL, and Ours present better static collision avoidance performances than TD-A3C, GSP and Ours-Pre. We owe it to the premature collision prediction module, since the three models are all facilitated with the module, which encourages the agent to learn to sense static obstacles before acting.

Fig. 6: The collision action percentages of learning models as the navigation proceeds. We report the average values over five runs with standard deviations shown in error bands. The most notable observation is that the collision action percentage of our method decreases at time step , indicating that fewer collisions occur during the time interval (30, 35].
Fig. 7: The robotics system setup and the real-world scenes for training and testing.

V-C D Navigation in AI2-THOR

We further adapt our method to compare it with LSTM-A3C-KG-A [47], which integrates a D knowledge graph and sub-targets into a classic deep reinforcement learning framework to boost navigation performance. The experiment is conducted on all the kitchen rooms of AI2-THOR with the same training/testing split and success criterion as [47]. We randomly choose navigation tasks from the training/testing split, and all the initial locations are at least steps away from the targets. The performances of two methods are presented in Table IV. The results indicate that both methods are prone to over-fitting. However, our method shows and improvements in average SR and SPL compared to LSTM-A3C-KG-A, when evaluated on unseen scenes. We suggest that our variational generative module facilitates generalizable learning for navigation, which can infer useful information from the perceptible environment.

Seen scenes, Seen scenes, Unseen scenes
Seen targets Unseen targets
LSTM-A3C-KG-A [47] 98.44 / 52.58 44.25 / 14.89 41.09 / 7.20
Ours 91.58 / 75.16 86.33 / 67.04 47.32 / 22.09
TABLE IV: Average navigation performance (SR and SPL in %) comparison on AI2-THOR with RGB input.

V-D D Navigation in Real-World Indoor Scenes

We have evaluated our approach on a real-world dataset thus far. To validate the generalization to real-world settings, we employ a mobile robot, TurtleBot, equipped with four onboard monocular cameras for sensing RGB images. In the TurtleBot settings, the action space is also set to be consistent with by using of velocity control. The move action is approximately translation and the rotate action is approximately degrees of rotation. The move right/left action is complex due to the movement direction restrictions of TurtleBot. For example, we use a series of combinatorial motions, to produce the move right action in . We first use our robot to collect data from three training scenes in an academic building in the same way as [17] (see Figure 7, the dataset will be made publicly available), and then transfer our navigation model from AVD to the three scenes. The motivation for this setup is to help the agent become familiar with the general layouts of office or laboratory environments, and weaken the effect of robot type as well. We also test the robot in three more never-before-encountered office scenes for both the cross-target and cross-scene generalization evaluation.

Fig. 8: Visualization of the three trajectories of a TurtleBot to reach targets (in green) from the start images (in orange). The robot manages to reach the locations (in cyan) near the targets in the first two navigation tasks and fails in the scene with many repetitions (e.g., doors in a corridor).

We first evaluate the cross-target generalizations of navigation models in the same training scenes. We start the TurtleBot from different starting locations and orientations and set up navigation tasks. In addition, we test the robot in another three different office scenes that it has never encountered before for both the cross-target and cross-scene generalization evaluation by randomly setting up another navigation tasks. We judge the navigation to be successful if the robot stops near the target, and consider it a failure if the robot collides with an obstacle or does not reach the goal within steps. The performance is measured by the success rate over all navigation tasks. We also choose four tasks from the evaluation and show the results when using different models. As shown in Table V, TD-A3C-IL and G-LSTM-A3C-IL both struggle to generalize to unseen scenes using RGB, due to the high complexity of real indoor environments. The performances on both training scenes and testing scenes of GSP and our model are similar to the evaluation results on AVD. In general, our model outperforms these methods in generalizing to new targets or new scenes with novel layouts.

Three training scenes Three testing scenes
Model Task-1 Task-2 SR Task-1 Task-2 SR
TD-A3C-IL [3] Collide 52 Steps 44.0 Collide Fail 4.0
G-LSTM-A3C-IL [23] Fail 72 Steps 40.0 Fail Fail 6.0
GSP [15] 65 Steps 41 Steps 58.0 Collide Fail 22.0
Ours 51 Steps 38 Steps 62.0 Fail 51 Steps 30.0
TABLE V: Average navigation performance comparisons in real world with RGB input (SR in %).

In addition, we observe that our model achieves great performance when the target view contains some distinct objects, which happen to be in the front view of the start location. However, there is a high probability that the agent gets stuck in the corner and thrashes around in space without making progress when the initial view of the agent and the target image have no overlap. See Figure 8 for three front-view trajectories of TurtleBot generated by our method. Moreover, the robot exhibits significantly oscillatory and jerky motions during navigation since our method and all alternatives only predict discrete action commands. The testing time of our model for each navigation decision is about . However, the time for the robot to finish each locomotion is much longer. For example, the move right action in is converted to rotate right at for , move forward at for , and rotate left at for . The saltatorial velocity control results in jerky motions. Extension to continuous velocity control would make the method applicable in realistic environments. Videos are available in the supplementary material.

Vi Conclusions and future work

In this work, we present a navigation pipeline for target-driven visual navigation, which does not rely on any maps or localization services at runtime and is purely based on the visual input and a target image. In contrast to most learning-based navigation methods, we design a generative module before predictive navigation control. The key idea is to transfer the multi-modality in visual navigation control to the intermediate generative process, which is dealt with in a variational model. This transfer strengthens the connection between visual observations and navigation actions, thus improving the learning capability of our agent and leading to better generalization to new targets or scenes. In addition, we investigate three techniques to facilitate navigation, which further improves both the cross-scene and cross-target generalization of our agent in the real world.

One thing to note is that, in the current framework, we do not place attention on relevant areas of the visual input, which can allow the system to more rapidly detect useful information for decision making. Our future work will explore perceptual control in feature space during navigation learning, and will evaluate it in more complex environments.


  2. , where is for the collisions in Table I


  1. S. M. LaValle, Planning algorithms.   Cambridge university press, 2006.
  2. A. Dame and E. Marchand, “A new information theoretic approach for appearance-based navigation of non-holonomic vehicle,” in 2011 IEEE International Conference on Robotics and Automation, 2011, pp. 2459–2464.
  3. Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in Proc. ICRA, 2017, pp. 3357–3364.
  4. T. Chen, S. Gupta, and A. Gupta, “Learning exploration policies for navigation,” in 7th International Conference on Learning Representations, ICLR 2019., 2019.
  5. C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,” IEEE Transactions on robotics, vol. 32, no. 6, pp. 1309–1332, 2016.
  6. L. E. Kavraki, P. Svestka, J.-C. Latombe, and M. H. Overmars, “Probabilistic roadmaps for path planning in high-dimensional configuration spaces,” IEEE transactions on Robotics and Automation, vol. 12, no. 4, pp. 566–580, 1996.
  7. S. M. LaValle, J. J. Kuffner, B. Donald, et al., “Rapidly-exploring random trees: Progress and prospects,” Algorithmic and computational robotics: new directions, no. 5, pp. 293–308, 2001.
  8. D. A. Pomerleau, Neural Network Perception for Mobile Robot Guidance.   Norwell, MA, USA: Kluwer Academic Publishers, 1993.
  9. Y. LeCun, U. Muller, J. Ben, E. Cosatto, and B. Flepp, “Off-road obstacle avoidance through end-to-end learning,” in Proceedings of the 18th International Conference on Neural Information Processing Systems, ser. NIPS’05.   MIT Press, 2005, pp. 739–746.
  10. D. Watkins-Valls, J. Xu, N. Waytowich, and P. Allen, “Learning your way without map or compass: Panoramic target driven visual navigation,” arXiv preprint arXiv:1909.09295, 2019.
  11. T. Fan, X. Cheng, J. Pan, D. Manocha, and R. Yang, “Crowdmove: Autonomous mapless navigation in crowded scenarios,” arXiv preprint arXiv:1807.07870, 2018.
  12. D. Gordon, A. Kadian, D. Parikh, J. Hoffman, and D. Batra, “Splitnet: Sim2sim and task2task transfer for embodied visual navigation,” International Conference on Computer Vision, pp. 1022–1031, 2019.
  13. T. Fan, P. Long, W. Liu, J. Pan, R. Yang, and D. Manocha, “Learning resilient behaviors for navigation under uncertainty,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 5299–5305.
  14. A. J. Sathyamoorthy, J. Liang, U. Patel, T. Guan, R. Chandra, and D. Manocha, “Densecavoid: Real-time navigation in dense crowds using anticipatory behaviors,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 11 345–11 352.
  15. D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell, “Zero-shot visual imitation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 2050–2053.
  16. Q. Wu, D. Manocha, J. Wang, and K. Xu, “Neonav: Improving the generalization of visual navigation via generating next expected observations,” in The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020.   AAAI Press, 2020, pp. 10 001–10 008.
  17. A. Mousavian, A. Toshev, M. Fišer, J. Košecká, A. Wahid, and J. Davidson, “Visual representations for semantic target driven navigation,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 8846–8852.
  18. M. Wortsman, K. Ehsani, M. Rastegari, A. Farhadi, and R. Mottaghi, “Learning to learn how to learn: Self-adaptive visual navigation using meta-learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6750–6759.
  19. H.-T. L. Chiang, A. Faust, M. Fiser, and A. Francis, “Learning navigation behaviors end-to-end with autorl,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 2007–2014, 2019.
  20. Y. Chen, C. Liu, B. E. Shi, and M. Liu, “Robot navigation in crowds by graph convolutional networks with attention learned from human gaze,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 2754–2761, 2020.
  21. X. Ye, Z. Lin, H. Li, S. Zheng, and Y. Yang, “Active object perceiver: Recognition-guided policy learning for object searching on mobile robots,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2018, pp. 6857–6863.
  22. P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, et al., “Learning to navigate in complex environments,” arXiv preprint arXiv:1611.03673, 2016.
  23. Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian, “Building generalizable agents with a realistic and rich 3d environment,” in 6th International Conference on Learning Representations, ICLR 2018, Workshop Track Proceedings, 2018.
  24. J. Zhang, J. T. Springenberg, J. Boedecker, and W. Burgard, “Deep reinforcement learning with successor features for navigation across similar environments,” in Proc. IROS, 2017, pp. 2371–2378.
  25. A. Devo, G. Mezzetti, G. Costante, M. L. Fravolini, and P. Valigi, “Towards generalization in target-driven visual navigation by using deep reinforcement learning,” IEEE Transactions on Robotics, 2020.
  26. P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, et al., “On evaluation of embodied navigation agents,” arXiv preprint arXiv:1807.06757, 2018.
  27. M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, et al., “Habitat: A platform for embodied ai research,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 9339–9347.
  28. S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordination for robotic grasping with large-scale data collection,” in International symposium on experimental robotics.   Springer, 2016, pp. 173–184.
  29. Y. Gao, H. Xu, J. Lin, F. Yu, S. Levine, and T. Darrell, “Reinforcement learning from imperfect demonstrations,” in 6th International Conference on Learning Representations, ICLR 2018, Workshop Track Proceedings, 2018.
  30. S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 627–635.
  31. J. Ho and S. Ermon, “Generative adversarial imitation learning,” in Advances in neural information processing systems, 2016, pp. 4565–4573.
  32. B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning,” in Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, D. Fox and C. P. Gomes, Eds.   AAAI Press, 2008, pp. 1433–1438.
  33. C. You, J. Lu, D. P. Filev, and P. Tsiotras, “Advanced planning for autonomous vehicles using reinforcement learning and deep inverse reinforcement learning,” Robotics Auton. Syst., vol. 114, pp. 1–18, 2019.
  34. C. Xia and A. El Kamel, “Neural inverse reinforcement learning in autonomous navigation,” Robotics and Autonomous Systems, vol. 84, pp. 1–14, 2016.
  35. S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik, “Cognitive mapping and planning for visual navigation,” in Proc. CVPR, 2017, pp. 2616–2625.
  36. B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and autonomous systems, vol. 57, no. 5, pp. 469–483, 2009.
  37. L. Lind, “Deep learning navigation for ugvs on forests paths,” 2018.
  38. C. Richter and N. Roy, “Safe visual navigation via deep learning and novelty detection,” Robotics: Science and Systems Foundation, 2017.
  39. F. Codevilla, M. Müller, A. M. López, V. Koltun, and A. Dosovitskiy, “End-to-end driving via conditional imitation learning,” in 2018 IEEE International Conference on Robotics and Automation, ICRA 2018.   IEEE, 2018, pp. 1–9.
  40. J. Xu, Q. Liu, H. Guo, A. Kageza, S. AlQarni, and S. Wu, “Shared multi-task imitation learning for indoor self-navigation,” in IEEE Global Communications Conference, GLOBECOM 2018.   IEEE, 2018, pp. 1–7.
  41. M. Pfeiffer, S. Shukla, M. Turchetta, C. Cadena, A. Krause, R. Siegwart, and J. Nieto, “Reinforced imitation: Sample efficient deep reinforcement learning for mapless navigation by leveraging prior demonstrations,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 4423–4430, 2018.
  42. P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine, “Learning to poke by poking: Experiential learning of intuitive physics,” in Advances in neural information processing systems, 2016, pp. 5074–5082.
  43. L. Buesing, T. Weber, S. Racaniere, S. Eslami, D. Rezende, D. P. Reichert, F. Viola, F. Besse, K. Gregor, D. Hassabis, et al., “Learning and querying fast generative models for reinforcement learning,” arXiv preprint arXiv:1802.03006, 2018.
  44. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  45. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015.
  46. T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” in 6th International Conference on Learning Representations, ICLR 2018, Conference Track Proceedings., 2018.
  47. Y. Lv, N. Xie, Y. Shi, Z. Wang, and H. T. Shen, “Improving target-driven visual navigation with attention on 3d spatial relationships,” arXiv preprint arXiv:2005.02153, 2020.
  48. T. Tieleman and G. Hinton, “Rmsprop gradient optimization,” URL http://www. cs. toronto. edu/tijmen/csc321/slides/lecture_slides_lec6. pdf, 2014.
  49. W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi, “Visual semantic navigation using scene priors,” in 7th International Conference on Learning Representations, ICLR 2019, 2019.
  50. M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised auxiliary tasks,” in 5th International Conference on Learning Representations, ICLR 2017, Conference Track Proceedings, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description