Towards Learning a Generic Agent forVision-and-Language Navigation via Pre-training

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training


Learning to navigate in a visual environment following natural-language instructions is a challenging task, because the multimodal inputs to the agent are highly variable, and the training data on a new task is often limited. In this paper, we present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks. By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions. It can be easily used as a drop-in for existing VLN frameworks, leading to the proposed agent Prevalent1. It learns more effectively in new tasks and generalizes better in a previously unseen environment. The performance is validated on three VLN tasks. On the Room-to-Room [3] benchmark, our model improves the state-of-the-art from 47% to 51% on success rate weighted by path length. Further, the learned representation is transferable to other VLN tasks. On two recent tasks, vision-and-dialog navigation [30] and “Help, Anna!” [22], the proposed Prevalent leads to significant improvement over existing methods, achieving a new state of the art.


1 Introduction

Learning to navigate in a photorealistic home environment based on natural language instructions has attracted increasing research interest  [23, 14, 7, 3, 6], as it provides insight into core scientific questions about multimodal representations. It also takes a step toward real-world applications, such as personal assistants and in-home robots. Vision-and-language navigation (VLN) presents a challenging reasoning problem for agents, as the multimodal inputs are highly variable, inherently ambiguous, and often under-specified.

Most previous methods build on the sequence-to-sequence architecture [26], where the instruction is encoded as a sequence of words, and the navigation trajectory is decoded as a sequence of actions, enhanced with attention mechanisms [3, 32, 18] and beam search [9]. While a number of methods [20, 21, 33] have been proposed to improve language understanding, common to all existing work is that the agent learns to understand each instruction from scratch or in isolation, without collectively leveraging prior vision-grounded domain knowledge.

However, each instruction in practice only loosely aligns with the desired navigation path, making it imperfect for the existing paradigm of learning to understand the instruction from scratch. This is because every instruction only partially characterizes the trajectory. It can be ambiguous to interpret the instructions, without grounding on the visual states. The objects in visual states and language instructions may share various common forms/relationships, and therefore it is natural to build an informative joint representation beforehand, and use this “common knowleldge” for transfer learning in downstream tasks.

Figure 1: Illustration of the proposed pre-training & fine-tuning paradigm for VLN. The image-text-action triplets are collected from the R2R dataset. The model is pre-trained with two self-supervised learning objectives, and fine-tuned for three tasks: R2R, CVND and HANNA. R2R is an in-domain task, where the language instruction is given at the beginning, describing the full navigation path. CVND and HANNA are out-of-domain tasks, the former is to navigate based on dialog history, while the latter is an interactive environment, where intermediate instructions are given in the middle of navigation.

To address this natural ambiguity of instructions more effectively, we propose to pre-train an encoder to align language instructions and visual states for joint representations. The image-text-action triplets at each time step are independently fed into the model, which is trained to predict the masked word tokens and next actions, thus formulating the VLN pre-training in the self-learning paradigm. The complexity of VLN learning can then be reduced by eliminating language understandings which lack consensus from visual states. The pre-trained model plays the role of providing generic image-text representations, and is applicable to most existing approaches to VLN, leading to our agent Prevalent. We consider three VLN scenarios as downstream tasks: Room-to-room (R2R) [3], cooperative vision-and-dialog navigation (CVDN) [30], and “Help, Anna!” (HANNA) [22]. The overall pre-training & finetuning pipeline is shown in Figure 1.

Comprehensive experiments demonstrate strong empirical performance of Prevalent. The proposed Prevalent achieves a new state of the art on all three tasks 2. Comparing with existing methods, it adapts faster, and generalizes better to unseen environments and new tasks. Our code and pre-trained model is released on GitHub 3.

2 Related Work

Vision-language pre-training

Vision-Language Pre-trainig (VLP) is a rapidly growing research area. The existing approaches employ BERT-like objectives [8] to learn cross-modal representation for various vision-language problems, such as visual question-answering, image-text retrieval and image captioning \etc [25, 27, 17, 34, 24, 15]. However, these VLP works focus on learning representations only for vision-language domains. This paper presents the first pre-trained models, grounding vision-language understanding with actions in a reinforcement learning setting. Further, existing VLP methods require faster R-CNN features as visual inputs [10, 2], which are not readily applicable to VLN. State-of-the-art VLN systems are based on panoramic views (\eg, 36 images per view for R2R), and therefore it is computationally infeasible to extract region features for all views and feed them into the agent.

Vision-and-language navigation

Various methods have been proposed for learning to navigate based on vision-language cues. In [9] a panoramic action space and a “speaker” model were introduced for data augmentation. A novel neural decoding scheme was proposed in [12] with search, to balance global and local information. To improve the alignment of the instruction and visual scenes, a visual-textual co-grounding attention mechanism was proposed in [18], which is further improved with a progress monitor [19]. To improve the generalization of the learned policy to unseen environments, reinforcement learning has been considered, including planning [33], and exploration of unseen environments using a off-policy method [32]. An environment dropout was proposed [28] to generate more environments based on the limited data, so that it can generalize well to unseen environments. These methods are specifically designed for particular tasks, and hard to generalize for new tasks. In this paper, we propose the first generic agent that is pre-trained to effectively understand vision-language inputs for a broad range of navigation tasks, and can quickly adapt to new tasks. The most related agent to ours is PreSS [16]. However, our work is different from  [16] from two perspectives: PreSS employs an off-the-shelf model BERT [8] model for language instruction understanding, while we pre-train a vision-language encoder from scratch, specifically for the navigation tasks. PreSS only focuses on the R2R task, while we verify the effectiveness of our pre-trained model on three tasks, including two out-of-domain navigation tasks.

3 Background

The VLN task can be formulated as a Partially Observable Markov Decision Process (POMDP) , where is the visual state space, is a discrete action space, is the unknown environment distribution from which we draw the next state, and is the reward function. At each time step , the agent first observes an RGB image , and then takes an action . This leads the simulator to generate a new image observation as the next state. The agent interacts with the environment sequentially, and generates a trajectory of length . The episode ends when the agent selects the special action, or when a pre-defined maximum trajectory length is reached. The navigation is successfully completed if the trajectory terminates at the intended target location.

In a typical VLN setting, the instructions are represented as a set , where is the number of alternative instructions, and each instruction consists of a sequence of word tokens, . The training dataset consists of pairs of the instruction together with its corresponding expert trajectory . The agent then learns to navigate via performing maximum likelihood estimation (MLE) of the policy , based on the individual sequences:


where are the policy parameters. The policy is usually parameterized as an attention-based Seq2Seq model [3, 9], trained in the teacher-forcing fashion, \ie, the ground-truth states are provided at every step in training. This allows reparameterization of the policy as an encoder-decoder architecture, by considering a function decomposition :

  • A vision-language encoder , where a joint representation at time step is learned over the visual state and the language instruction .

  • An action decoder . For each joint representation , we ground it with via neural attention, and decode into actions .

Successful navigation largely depends on precise joint understanding of natural language instructions and the visual states [29]. We isolate the encoder stage, and focus on pre-training a generic vision-language encoder for various navigation tasks.

4 Pre-training Models

Our pre-training model aims to provide joint representations for image-text inputs in VLN.

4.1 Input Embeddings

The input embedding layers convert the inputs (\ie, panoramic views and language instruction) into two sequences of features: image-level visual embeddings and word-level sentence embeddings.

Visual Embedding

Following [9], we employ panoramic views as visual inputs to the agent. Each panoramic view consists of 36 images in total (12 angles, and 3 camera poses per angle): . Each image is represented as a 2176-dimensional feature vector , as a result of the concatenation of two vectors: The 2048-dimensional visual feature output by a Residual Network (ResNet) of the image [11]; the 128-dimensional orientation feature vector that repeats 32 times, where and are the heading and elevation poses, respectively [9]. The embedding for each image is:


where is a weight matrix, and is the bias term; in our experiments. Layer normalization (LN) [4] is used on the output of this fully connected (FC) layer. An illustration of the visual embedding is shown in Figure 2(a).

Text Embedding

The embedding layer for the language instruction follows the standard Transformer, where LN is applied to the summation of the token embedding and position embedding. An illustration of the text embedding is shown in Figure 2(b).

(a) Visual embedding     (b) Text embedding
Figure 2: Illustration for the representation procedure of (a) visual embedding and (b) text embedding. is the fully-connected layer, is the layer-normalization layer.

4.2 Encoder Architecture

Our backbone network has three principal modules, including two single-modal encoders (one for each modality), followed by a cross-modal encoder. All modules are based on a multi-layer Transformer [31]. For the -th Transformer layer, its output is


where is the previous layer’s features ( is the sequence length), is the feature matrix to attend, and is the mask matrix, determining whether a pair of tokens can be attended to each other. More specifically, in each Transformer block, the output vector is the concatenation of multiple attention heads ( is the number of heads). One attention head is computed via:


where and are linearly projected to a triple of queries, keys and values using parameter matrices , respectively; is the projection dimension. In the following, we use different mask matrices and attended feature matrices to construct the contextualized representation for each module.

Single-modal Encoder

The standard self-attention layer is used in the single-modal encoder. All of the keys, values and queries come from the output of the previous layer in the encoder. Each position in the encoder can attend to all positions that belong to its own modality in the previous layer. Specifically, is a full-zero matrix, and . Similar to the self-attention encoder module in the standard Transformer, the position-wise feed-forward network (FFN) is used.

Cross-modal Encoder

To fuse the features from both modalities, the cross-attention layer is considered. The queries come from the previous layer of the other modality, and the memory keys and values come from the output of the current modality. It allows every position in the encoder to attend over all positions in the different modality. This mimics the typical encoder-decoder attention mechanisms in the Transformer, but here we consider two different modalities, rather than input-output sequences. This cross-attention layer is followed by a self-attention layer and an FFN layer.

The overall model architecture is illustrated in Figure 3. Following [27], , and . The last layer output of the encoder is denoted as , which is used as the features in the downstream tasks.

Figure 3: Illustration of the proposed pre-training model. In this example, two learning objectives are considered: () image-attended masked language modeling is performed on the masked word in the instruction; () action prediction is performed to make the decision to navigate toward direction . Only the language features are used for fine-tuning in downstream tasks.

4.3 Pre-training Objectives

We introduce two main tasks to pre-train our model: Image-attended masked language modeling (MLM) and action prediction (AP). For an instruction-trajectory pair from the training dataset , we assume a state-action pair from the trajectory follows an independent identical distribution given the instruction in the pre-training stage: .

Attended Masked Language Modeling

We randomly mask out the input words with probability , and replace the masked ones with special token . The goal is to predict these masked words based on the observation of their surrounding words and all images by minimizing the negative log-likelihood:


This is in analogy to the cloze task in BERT, where the masked word is recovered from surrounding words, but with additional image information to attend. It helps the learned word embeddings to be grounded in the context of visual states. This is particularly important for VLN tasks, where the agent is required to monitor the progress of completed instruction by understanding the visual images.

Action Prediction

The output on the special token indicates the fused representation of both modalities. We apply an FC layer on top of the encoder output of to predict the action. It scores how well the agent can make the correct decision conditioned on the current visual image and the instruction, without referring to the trajectory history. During training, we sample a state-action pair from the trajectory at each step, and then apply a cross-entropy loss for optimization:


The full pre-training objective is:



Other loss designs can be considered for the pre-training objective. Our initial results on masked image modeling did not show better results, and thus are excluded in the experiments.

4.4 Pre-training Datasets

We construct our pre-training dataset based on the Matterport3D Simulator, a photo-realistic visual reinforcement learning (RL) simulation environment for the development of intelligent agents based on the Matterport3D dataset [5]. Specifically, it consists of two sets: The training datasets of R2R, which has 104K image-text-action triplets; we employed the Speaker model in [9] to synthesize 1,020K instructions for the shortest-path trajectories on the training environments. It leads to 6,482K image-text-action triplets. Therefore, the pre-training dataset size is 6,582K.

5 Adapting to new tasks

We focus on three downstream VLN tasks that are based on the Matterport3D simulator. Each task poses a very different challenge to evaluate the agent. The R2R task is used as an in-domain task; it can verify the agent’s generalization capability to unseen environments. CVDN and HANNA are considered as out-of-domain tasks, to study the generalization ability of our agent to new tasks. More specifically, CVDN considers indirect instructions (\ie, dialog history), and HANNA is an interactive RL task.

5.1 Room-to-Room

In R2R, the goal is to navigate from a starting position to a target position with the minimal trajectory length, where the target is explicitly informed in a language instruction. To use the pre-trained model for fine-tuning in R2R, the attended contextualized wording embeddings are fed into an LSTM encoder-decoder framework, as in [9, 16]. In prior work, random initialization is used in [9], and BERT is used in [16]. In contrast, our word embeddings are pre-trained from scratch with VLN data and tasks.

5.2 Cooperative Vision-and-Dialogue Navigation

In the CVDN environment, the Navigation from Dialog History (NDH) is defined, where agent searches an environment for a goal location, based on the dialog history that consists of multiple turns of question & answering interactions between the the agent and to its partner. The partner has privileged access to the best next steps that the agent should take according to a shortest path planner. CVDN is more challenging than R2R, in that the instructions from the dialog history are often ambiguous, under-specified, and indirect to the final target. The fine-tuning model architecture for CVDN is the same as R2R, except that CVND usually has much longer text input. We limit the sequence length to 300. Words that are longer than 300 in a dialog history are removed.

5.3 HANNA: Interactive Imitation Learning

HANNA simulates a scenario, where a human requester asks an agent via language to find an object in an indoor environment, without specifying the process of how to complete the task. The only source of help the agent can leverage in the environment is the assistant, who helps the agent by giving subtasks in the form of a natural language instruction that guides the agent to a specific location, and an image of the view at that location. When the help mode is triggered, we use our pre-trained model to encode the language instructions, and the features are used for the rest of their system.

6 Experimental Results

6.1 Training details


We pre-train the proposed model on eight V100 GPUs, the batch size for each GPU is 96. The AdamW optimizer [13] is used, and the learning rate is . The total training epochs is 20.


The fine-tuning is performed on NVIDIA 1080Ti GPU. For the R2R task, we follow the same learning schedule as [28]. When training the augmented listener, we use batch size 20. We continue to fine-tune the cross-attention encoder for 20k iterations, with the batch size 10 and learning rate . For the NDH task, we follow the same learning schedule as in [30], and choose the batch size as 15 and learning rate as . For HANNA, the training schedule is the same as [22]. The batch size is 32 and learning rate is .

6.2 Room-to-Room


The R2R dataset [3] consists of 10,800 panoramic views (each panoromic view has 36 images) and 7,189 trajectories. Each trajectory is paired with three natural language instructions. The R2R dataset consists of four splits: train, validation seen and validation unseen, test unseen. The challenge of R2R is to test the agent’s generalization ability in unseen environments.

Evaluation Metrics

The performance of different agents is evaluated using the following metrics:

  • Trajectory Length measures the average length of the navigation trajectory.

  • Navigation Error is the mean of the shortest path distance in meters between the agent’s final location and the target location.

  • Success Rate is the percentage of the agent’s final location that is less than 3 meters away from the target location.

  • Success weighted by Path Length [1] trades-off SR against TL. Higher score represents more efficiency in navigation.

Among these metrics, SPL is the recommended primary metric, and other metrics are considered as auxiliary measures.

Validation Seen Validation Unseen Test Unseen
Random 9.58 9.45 16 - 9.77 9.23 16 - 9.93 9.77 13 12
Seq2Seq 11.33 6.01 39 - 8.39 7.81 22 - 0,08.13 7.85 20 18
RPA - 5.56 43 - - 7.65 25 - 0,09.15 7.53 25 23
Greedy, S Speaker-Follower - 3.36 66 - - 6.62 35 - 0,014.82 6.62 35 28
SMNA - - - - - - - - 0,018.04 5.67 48 35
RCM+SIL(train) 10.65 3.53 67 - 11.46 6.09 43 - 11.97 6.12 43 38
Regretful - 3.23 69 63 - 5.32 50 41 13.69 5.69 48 40
Fast - - - - 21.17 4.97 56 43 22.08 5.14 54 41
EnvDrop 11.00 3.99 62 59 10.70 5.22 52 48 11.66 5.23 51 47
Press 10.57 4.39 58 55 10.36 5.28 49 45 10.77 5.49 49 45
Prevalent (ours) 10.32 3.67 69 65 10.19 4.71 58 53 10.51 5.30 54 51
M Press 10.35 3.09 71 67 10.06 4.31 59 55 10.52 4.53 57 53
Prevalent 10.31 3.31 67 63 9.98 4.12 60 57 10.21 4.52 59 56
Human - - - - - - - - 0,011.85 1.61 86 76
Table 1: Comparison with the state-of-the-art methods on R2R. Blue indicates best value in a given setting. S indicate single-instruction setting, M indicates multiple-instruction settings.


We compare our approach with nine recently published systems:

  • Random: an agent that randomly selects a direction and moves five step in that direction  [3].

  • S2S-Anderson: a sequence-to-sequence model using a limited discrete action space [3].

  • RPA [33]: is an agent which combines model-free and model-based reinforcement learning, using a look-ahead module for planning.

  • Speaker-Follower [9]: an agent trained with data augmentation from a speaker model on the panoramic action space.

  • Smna [18]: an agent trained with a visual-textual co-grounding module and a progress monitor on the panoramic action space.

  • RCM+SIL [32]: an agent trained with cross-modal grounding locally and globally via RL.

  • Regretful [19]: an agent with a trained progress monitor heuristic for search that enables backtracking.

  • Fast [12]: an agent that uses a fusion function to score and compare partial trajectories of different lengths, which enables the agent to efficiently backtrack after a mistake.

  • EnvDrop [28]: an agent is trained with environment dropout, which can generate more environments based on the limited seen environments.

  • PreSS [16]: an agent is trained with pre-trained language models and stochastic sampling to generalize well in the unseen environment.

Comparison with SoTA

Table 1 compares the performance of our agent against the existing published top systems.4. Our agent Prevalent outperforms the existing models on SR and SPL by a large margin. On both validation seen and unseen environments, Prevalent outperforms other agents on nearly all metrics.

In PreSS [16], multiple introductions are used. To have a fair comparison, we follow [16], and report Prevalent results. We see that testing SPL is improved. Further, the gap between seen and unseen environments of Prevalent is smaller than PreSS, meaning that image-attended language understanding is more effective to help the agent generalize better to unseen environment.

Validation Unseen Test Unseen
Agent Oracle Navigator Mixed Oracle Navigator Mixed
Random 1.09 1.09 1.09 0.83 0.83 0.83
Seq2Seq 1.23 1.98 2.10 1.25 2.11 2.35
Prevalent (Ours) 2.58 2.99 3.15 1.67 2.39 2.44
Shortest Path Agent 8.36 7.99 9.58 8.06 8.48 9.76
Table 2: Results on CVDN measure by Goal Progress. Blue indicates best value in a given setting.

6.3 Cooperative Vision-and-Dialogue Navigation

Dataset & Evaluation Metric

The CVDN dataset has 2050 human-human navigation dialogs, comprising over 7K navigation trajectories punctuated by question-answer exchanges, across 83 MatterPort houses [5] . The metrics for R2R can be readily used for the CVDN dataset. Further, one new metric is proposed for the NDH task:

  • Goal Progress measure the difference between completed distance and left distance to the goal. Larger values indicate a more efficient agent.

Three settings are considered, depending on which ground-truth action/path is employed [30]. Oracle indicates the shortest path, Navigator indicates the path taken by the navigator. The Mixed supervision path means to take the navigator path if available, otherwise the shortest path. The results are in Table 2. The proposed Prevalent significantly outperforms the Seq2Seq baseline on both validation and testing unseen environments in all settings, leading to the top position on the leaderboard 5. Note that our encoder is pre-trained on R2R dataset. We observe that it can provide significant improvement when used the new task built on the CVDN dataset. This shows that the pre-trained model can adapt well on new tasks, and yield better generalization.

Rule Random Walk 0.54 0.33 15.38 0.0 0.46 0.23 15.34 0.0
Forward 10 5.98 4.19 14.61 0.0 6.36 4.78 13.81 0.0
No assistance 17.21 13.76 11.48 0.0 8.10 4.23 13.22 0.0
Anna 88.37 63.92 1.33 2.9 47.45 25.50 7.67 5.8
Prevalent (Ours) 83.82 59.38 1.47 3.4 52.91 28.72 5.29 6.6
Skyline Shortest 100.00 100.00 0.00 0.0 0,0100.00 100.00 0.00 0.0
Perfect assistance 90.99 68.87 0.91 2.5 0,083.56 56.88 1.83 3.2
Table 3: Results on test splits of HANNA. The agent with “perfect assistance” uses the teacher navigation policy to make decisions when executing a subtask from the assistant. Blue indicates the best value.

6.4 Hanna

Dataset & Evaluation Metric

The HANNA dataset features 289 object types; the language instruction vocabulary contains 2,332 words. The numbers of locations on the shortest paths to the requested objects are restricted to be between 5 and 15. With an average edge length of 2.25 meters, the agent has to travel about 9 to 32 meters to reach its goals. Similar to R2R, SR, SPL and NE are used to evaluate the navigation. Further, one new metric is considered for this interactive task:

  • Number of requests measure the how many helps are requested by the agent.

The results are shown in Table 3. Two rule-based methods and two skyline methods are reported as references, see [22] for details. Our Prevalent outperforms the baseline agent Anna on the test unseen environments in terms of SR, SPL and NE, while requesting a slightly higher number of helps (#R). When measuring the performance gap between seen and unseen environments, we see that Prevalent shows a significantly smaller difference than Anna, \eg, (59.38-28.72=30.66) vs (63.92-25.50=38.42) for SPL. This means that the pre-trained joint representation by Prevalent can reduce over-fitting, and generalise better to unseen environments.

6.5 Ablation Studies

Navigation QA Oracle Answer All
Methods Oracle Navigator Mixed Oracle Navigator Mixed Oracle Navigator Mixed
2.80 3.01 3.28 2.78 3.44 3.38 2.58 2.99 3.15
2.69 3.00 3.25 2.84 3.35 3.19 2.52 2.98 3.14
BERT pre-trainig 2.26 2.71 2.94 2.70 2.68 3.06 2.46 2.74 2.64
BERT fine-tuning 2.39 2.03 2.51 2.23 2.41 2.52 2.32 2.93 2.28
Table 4: Ablation study of the pre-training objectives on CVDN measured by Goal Progress. Blue indicates the best value.
Validation Seen Validation Unseen Test Unseen
Two-stage    10.32 3.67 0.69 0.66 10.19 4.71 0.58 0.53 10.51 5.30 0.54 0.51
Feature-based    10.13 3.98 0.66 0.64 9.70 5.01 0.54 0.51 9.99 5.54 0.52 0.49
Table 5: Ablation study on R2R: feature-based vs fine-tuning. Blue indicates the better value.

Is pre-training with actions helpful?

Our pre-training objective in (9) includes two losses, and . To study the impact of each loss, we pre-train two model variants: one is based on the full objective , the other only uses . To verify its impact on new tasks, we consider CVDN first, and the results are shown in Table 4. Three types of text inputs are considered: Navigation QA, Orcale Answer, and All (a combination of both). More details are provided in the Appendix.

When is employed in the objective, we see consistent improvement on nearly all metrics and settings. Note that our MLM is different from BERT in that the attention over images is used in the cross-layer. To verify whether the image-attended learning is necessary, we consider BERT in two ways. BERT pre-training: we apply the original MLM loss in BERT on our R2R pre-training dataset. The newly pre-trained BERT is used for fine-tuning on CVDN. BERT fine-tuning: we directly fine-tune the off-shelf BERT on CVDN. Their performances are lower than the two variants of the proposed Prevalent. This means our image-attended MLM is more effective for navigation tasks. More ablation studies on the pre-training objectives are conducted for HANNA, with results shown in the Appendix.

Feature-based vs Fine-tuning

The pre-trained encoder can be used in two modes: fine-tuning approach, where a task-specific layer is added to the pre-trained model, and all parameters are jointly updated on a downstream task. feature-based approach, where fixed features are extracted from the pre-trained model, and only the task-specific layer is updated. In this paper, all Prevalent results presented generally have used the feature-based approach, as there are major computational benefits to pre-compute an expensive representation of the training data once, and then run many experiments with cheaper models on top of this representation. In the R2R dataset, we consider a two-stage scheme, where we fine-tune the cross-attention layers of the agent, after training via the feature-based approach. The results are reported in Table 5. We observe notable improvement with this two-stage scheme on nearly all metrics, expect the trajectory length.

(a) R2R
(b) CVDN
Figure 4: Learning curves on (a) R2R and (b) CVDN.

How does pre-training help generalization?

We plot the learning curves on the seen/unseen environments for R2R in Figure 4(a), and CVDN in Figure 4(b). Compared with the random initialized word embeddings in EnvDrop [28], the pre-trained word embeddings can adapt faster (especially in the early stage), and converges to higher performance in unseen environments. This is demonstrated by the SPL values in the Figure 4(a). By comparing the learning curves in Figure 4(b), we see a much smaller gap between seen and unseen environments for Prevalent than the Seq2Seq baseline [30], meaning pre-training is an effective tool to help reduce over-fitting in learning.

7 Conclusions

We present Prevalent, a new pre-training and fine-tuning paradigm for vision-and-language navigation problems. This allows for more effective use of the limited training data to improve generalization to the previously unseen environments, and new tasks. The pre-trained encoder can be easily plugged into existing models to boost their performance. Empirical results on three benchmarks (R2R, CVDN and HANNA) demonstrate that Prevalent significantly improves over the existing methods, achieving new state-of-the-art performance.

Supplementary Material: Towards Learning a Generic Agent for

Vision-and-Language Navigation via Pre-training

Summary of Contributions.

Weituo implemented the algorithm, made the model work, and ran all experiments. Chunyuan initiated the idea of pre-training the first generic agent for VLN, led and completed the manuscript writing. Xiujun provided the codebase and helped implementation. Lawrence and Jianfeng edited the final manuscript.

Appendix A Experiments

Three types of inputs on CVDN

We illustrate the naming of three types of text inputs on CVDN in Table 6.

Oracle Answer
Navigation QA
Table 6: Three types of inputs on CVDN. is the target object, is the ResNet feature. and are the question and answers in the -th turn. are the question & answer pairs before the -th turn.

Ablation Study Results on HANNA

Table 7 shows the results with different pre-training objectives. We see that the yields the best performance among all variants.

Prevalent () 83.82 59.38 1.47 3.4 52.91 28.72 5.29 6.6
Prevalent () 78.75 54.68 1.82 4.3 44.29 24.27 6.33 8.1
BERT (feature-based) 57.54 34.33 4.71 3.9 24.12 11.50 9.55 11.3
BERT (fine-tuning) 80.75 57.46 1.97 4.0 26.36 12.66 9.1 8.3
Table 7: Ablation study of pre-training objectives on test splits of HANNA.

Appendix B Comparison with Related Work

Comparison with Press.

The differences are summarized in the Table below. Empirically, we show that (1) incorporating visual and action information into pre-training can improve navigation performance; (2) Pre-training can generalize across different new navigation tasks.

Comparison with vision-language pre-training (VLP).

The differences are in the table below. Though the proposed methodology generally follows self supervised learning such as VLP or BERT, our research scope and problem setups are different, which rendering existing pre-models are not readily applicable.

(a) Press (b) VLP
Table 8: Comparison with related works.


  1. Pre-trained vision-and-language based navigator
  2. Among all the public results at the time of this submission.
  4. The full list of leaderboard is publicly available:
  5. The full list of leaderboard is publicly available:


  1. P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva and A. Zamir (2018) On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757. Cited by: item SPL.
  2. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Cited by: §2.
  3. P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould and A. van den Hengel (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In CVPR, Vol. 2. Cited by: Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training, §1, §1, §1, §3, 1st item, 2nd item, §6.2.
  4. J. L. Ba, J. R. Kiros and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.1.
  5. A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng and Y. Zhang (2017) Matterport3D: learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV). Cited by: §4.4, §6.3.
  6. H. Chen, A. Shur, D. Misra, N. Snavely and Y. Artzi (2010) Touchdown: natural language navigation and spatial reasoning in visual street environments. CVPR. Cited by: §1.
  7. A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh and D. Batra (2018) Embodied question answering. In CVPR, Cited by: §1.
  8. J. Devlin, M. Chang, K. Lee and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. NAACL. Cited by: §2, §2.
  9. D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein and T. Darrell (2018) Speaker-follower models for vision-and-language navigation. NIPS. Cited by: §1, §2, §3, §4.1, §4.4, §5.1, 4th item.
  10. R. Girshick (2015) Fast R-CNN. In CVPR, Cited by: §2.
  11. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1.
  12. L. Ke, X. Li, Y. Bisk, A. Holtzman, Z. Gan, J. Liu, J. Gao, Y. Choi and S. Srinivasa (2019) Tactical rewind: self-correction via backtracking in vision-and-language navigation. CVPR. Cited by: §2, 8th item.
  13. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.1.
  14. E. Kolve, R. Mottaghi, D. Gordon, Y. Zhu, A. Gupta and A. Farhadi (2017) AI2-THOR: an interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474. Cited by: §1.
  15. G. Li, N. Duan, Y. Fang, D. Jiang and M. Zhou (2019) Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066. Cited by: §2.
  16. X. Li, C. Li, Q. Xia, Y. Bisk, A. Celikyilmaz, J. Gao, N. Smith and Y. Choi (2019) Robust navigation with language pretraining and stochastic sampling. EMNLP. Cited by: §2, §5.1, 10th item, §6.2.
  17. J. Lu, D. Batra, D. Parikh and S. Lee (2019) VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NIPS. Cited by: §2.
  18. C. Ma, J. Lu, Z. Wu, G. AlRegib, Z. Kira, R. Socher and C. Xiong (2019) Self-monitoring navigation agent via auxiliary progress estimation. ICLR. Cited by: §1, §2, 5th item.
  19. C. Ma, Z. Wu, G. AlRegib, C. Xiong and Z. Kira (2019) The regretful agent: heuristic-aided navigation through progress estimation. CVPR. Cited by: §2, 7th item.
  20. D. Misra, J. Langford and Y. Artzi (2017) Mapping instructions and visual observations to actions with reinforcement learning. EMNLP. Cited by: §1.
  21. W. Monroe, R. X. Hawkins, N. D. Goodman and C. Potts (2017) Colors in context: a pragmatic neural model for grounded language understanding. TACL. Cited by: §1.
  22. K. Nguyen and H. Daumé III (2019) Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. EMNLP. Cited by: Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training, §1, §6.1, §6.4.
  23. M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser and V. Koltun (2017) MINOS: multimodal indoor simulator for navigation in complex environments. arXiv preprint arXiv:1712.03931. Cited by: §1.
  24. W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei and J. Dai (2019) VL-BERT: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §2.
  25. C. Sun, A. Myers, C. Vondrick, K. Murphy and C. Schmid (2019) VideoBERT: a joint model for video and language representation learning. ICCV. Cited by: §2.
  26. I. Sutskever, O. Vinyals and Q. V. Le (2014) Sequence to sequence learning with neural networks. In NIPS, Cited by: §1.
  27. H. Tan and M. Bansal (2019) LXMERT: learning cross-modality encoder representations from transformers. EMNLP. Cited by: §2, §4.2.
  28. H. Tan, L. Yu and M. Bansal (2019) Learning to navigate unseen environments: back translation with environmental dropout. EMNLP. Cited by: §2, 9th item, §6.1, §6.5.
  29. J. Thomason, D. Gordon and Y. Bisk (2019) Shifting the baseline: single modality performance on visual navigation & qa. In NAACL, Cited by: §3.
  30. J. Thomason, M. Murray, M. Cakmak and L. Zettlemoyer (2019) Vision-and-dialog navigation. CoRL. Cited by: Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training, §1, §6.1, §6.3, §6.5.
  31. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §4.2.
  32. X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y. Wang, W. Y. Wang and L. Zhang (2019) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. CVPR. Cited by: §1, §2, 6th item.
  33. X. Wang, W. Xiong, H. Wang and W. Y. Wang (2018) Look before you leap: bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. ECCV. Cited by: §1, §2, 3rd item.
  34. L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso and J. Gao (2020) Unified vision-language pre-training for image captioning and VQA. AAAI. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description