Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training
Learning to navigate in a visual environment following natural-language instructions is a challenging task, because the multimodal inputs to the agent are highly variable, and the training data on a new task is often limited. In this paper, we present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks. By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions. It can be easily used as a drop-in for existing VLN frameworks, leading to the proposed agent Prevalent
Learning to navigate in a photorealistic home environment based on natural language instructions has attracted increasing research interest [23, 14, 7, 3, 6], as it provides insight into core scientific questions about multimodal representations. It also takes a step toward real-world applications, such as personal assistants and in-home robots. Vision-and-language navigation (VLN) presents a challenging reasoning problem for agents, as the multimodal inputs are highly variable, inherently ambiguous, and often under-specified.
Most previous methods build on the sequence-to-sequence architecture , where the instruction is encoded as a sequence of words, and the navigation trajectory is decoded as a sequence of actions, enhanced with attention mechanisms [3, 32, 18] and beam search . While a number of methods [20, 21, 33] have been proposed to improve language understanding, common to all existing work is that the agent learns to understand each instruction from scratch or in isolation, without collectively leveraging prior vision-grounded domain knowledge.
However, each instruction in practice only loosely aligns with the desired navigation path, making it imperfect for the existing paradigm of learning to understand the instruction from scratch. This is because every instruction only partially characterizes the trajectory. It can be ambiguous to interpret the instructions, without grounding on the visual states. The objects in visual states and language instructions may share various common forms/relationships, and therefore it is natural to build an informative joint representation beforehand, and use this “common knowleldge” for transfer learning in downstream tasks.
To address this natural ambiguity of instructions more effectively, we propose to pre-train an encoder to align language instructions and visual states for joint representations. The image-text-action triplets at each time step are independently fed into the model, which is trained to predict the masked word tokens and next actions, thus formulating the VLN pre-training in the self-learning paradigm. The complexity of VLN learning can then be reduced by eliminating language understandings which lack consensus from visual states. The pre-trained model plays the role of providing generic image-text representations, and is applicable to most existing approaches to VLN, leading to our agent Prevalent. We consider three VLN scenarios as downstream tasks: Room-to-room (R2R) , cooperative vision-and-dialog navigation (CVDN) , and “Help, Anna!” (HANNA) . The overall pre-training & finetuning pipeline is shown in Figure 1.
Comprehensive experiments demonstrate strong empirical performance of Prevalent. The proposed Prevalent achieves a new state of the art on all three tasks
2 Related Work
Vision-Language Pre-trainig (VLP) is a rapidly growing research area. The existing approaches employ BERT-like objectives  to learn cross-modal representation for various vision-language problems, such as visual question-answering, image-text retrieval and image captioning \etc [25, 27, 17, 34, 24, 15]. However, these VLP works focus on learning representations only for vision-language domains. This paper presents the first pre-trained models, grounding vision-language understanding with actions in a reinforcement learning setting. Further, existing VLP methods require faster R-CNN features as visual inputs [10, 2], which are not readily applicable to VLN. State-of-the-art VLN systems are based on panoramic views (\eg, 36 images per view for R2R), and therefore it is computationally infeasible to extract region features for all views and feed them into the agent.
Various methods have been proposed for learning to navigate based on vision-language cues. In  a panoramic action space and a “speaker” model were introduced for data augmentation. A novel neural decoding scheme was proposed in  with search, to balance global and local information. To improve the alignment of the instruction and visual scenes, a visual-textual co-grounding attention mechanism was proposed in , which is further improved with a progress monitor . To improve the generalization of the learned policy to unseen environments, reinforcement learning has been considered, including planning , and exploration of unseen environments using a off-policy method . An environment dropout was proposed  to generate more environments based on the limited data, so that it can generalize well to unseen environments. These methods are specifically designed for particular tasks, and hard to generalize for new tasks. In this paper, we propose the first generic agent that is pre-trained to effectively understand vision-language inputs for a broad range of navigation tasks, and can quickly adapt to new tasks. The most related agent to ours is PreSS . However, our work is different from  from two perspectives: PreSS employs an off-the-shelf model BERT  model for language instruction understanding, while we pre-train a vision-language encoder from scratch, specifically for the navigation tasks. PreSS only focuses on the R2R task, while we verify the effectiveness of our pre-trained model on three tasks, including two out-of-domain navigation tasks.
The VLN task can be formulated as a Partially Observable Markov Decision Process (POMDP) , where is the visual state space, is a discrete action space, is the unknown environment distribution from which we draw the next state, and is the reward function. At each time step , the agent first observes an RGB image , and then takes an action . This leads the simulator to generate a new image observation as the next state. The agent interacts with the environment sequentially, and generates a trajectory of length . The episode ends when the agent selects the special action, or when a pre-defined maximum trajectory length is reached. The navigation is successfully completed if the trajectory terminates at the intended target location.
In a typical VLN setting, the instructions are represented as a set , where is the number of alternative instructions, and each instruction consists of a sequence of word tokens, . The training dataset consists of pairs of the instruction together with its corresponding expert trajectory . The agent then learns to navigate via performing maximum likelihood estimation (MLE) of the policy , based on the individual sequences:
where are the policy parameters. The policy is usually parameterized as an attention-based Seq2Seq model [3, 9], trained in the teacher-forcing fashion, \ie, the ground-truth states are provided at every step in training. This allows reparameterization of the policy as an encoder-decoder architecture, by considering a function decomposition :
A vision-language encoder , where a joint representation at time step is learned over the visual state and the language instruction .
An action decoder . For each joint representation , we ground it with via neural attention, and decode into actions .
Successful navigation largely depends on precise joint understanding of natural language instructions and the visual states . We isolate the encoder stage, and focus on pre-training a generic vision-language encoder for various navigation tasks.
4 Pre-training Models
Our pre-training model aims to provide joint representations for image-text inputs in VLN.
4.1 Input Embeddings
The input embedding layers convert the inputs (\ie, panoramic views and language instruction) into two sequences of features: image-level visual embeddings and word-level sentence embeddings.
Following , we employ panoramic views as visual inputs to the agent. Each panoramic view consists of 36 images in total (12 angles, and 3 camera poses per angle): . Each image is represented as a 2176-dimensional feature vector , as a result of the concatenation of two vectors: The 2048-dimensional visual feature output by a Residual Network (ResNet) of the image ; the 128-dimensional orientation feature vector that repeats 32 times, where and are the heading and elevation poses, respectively . The embedding for each image is:
where is a weight matrix, and is the bias term; in our experiments. Layer normalization (LN)  is used on the output of this fully connected (FC) layer. An illustration of the visual embedding is shown in Figure 2(a).
The embedding layer for the language instruction follows the standard Transformer, where LN is applied to the summation of the token embedding and position embedding. An illustration of the text embedding is shown in Figure 2(b).
|(a) Visual embedding||(b) Text embedding|
4.2 Encoder Architecture
Our backbone network has three principal modules, including two single-modal encoders (one for each modality), followed by a cross-modal encoder. All modules are based on a multi-layer Transformer . For the -th Transformer layer, its output is
where is the previous layer’s features ( is the sequence length), is the feature matrix to attend, and is the mask matrix, determining whether a pair of tokens can be attended to each other. More specifically, in each Transformer block, the output vector is the concatenation of multiple attention heads ( is the number of heads). One attention head is computed via:
where and are linearly projected to a triple of queries, keys and values using parameter matrices , respectively; is the projection dimension. In the following, we use different mask matrices and attended feature matrices to construct the contextualized representation for each module.
The standard self-attention layer is used in the single-modal encoder. All of the keys, values and queries come from the output of the previous layer in the encoder. Each position in the encoder can attend to all positions that belong to its own modality in the previous layer. Specifically, is a full-zero matrix, and . Similar to the self-attention encoder module in the standard Transformer, the position-wise feed-forward network (FFN) is used.
To fuse the features from both modalities, the cross-attention layer is considered. The queries come from the previous layer of the other modality, and the memory keys and values come from the output of the current modality. It allows every position in the encoder to attend over all positions in the different modality. This mimics the typical encoder-decoder attention mechanisms in the Transformer, but here we consider two different modalities, rather than input-output sequences. This cross-attention layer is followed by a self-attention layer and an FFN layer.
4.3 Pre-training Objectives
We introduce two main tasks to pre-train our model: Image-attended masked language modeling (MLM) and action prediction (AP). For an instruction-trajectory pair from the training dataset , we assume a state-action pair from the trajectory follows an independent identical distribution given the instruction in the pre-training stage: .
Attended Masked Language Modeling
We randomly mask out the input words with probability , and replace the masked ones with special token . The goal is to predict these masked words based on the observation of their surrounding words and all images by minimizing the negative log-likelihood:
This is in analogy to the cloze task in BERT, where the masked word is recovered from surrounding words, but with additional image information to attend. It helps the learned word embeddings to be grounded in the context of visual states. This is particularly important for VLN tasks, where the agent is required to monitor the progress of completed instruction by understanding the visual images.
The output on the special token indicates the fused representation of both modalities. We apply an FC layer on top of the encoder output of to predict the action. It scores how well the agent can make the correct decision conditioned on the current visual image and the instruction, without referring to the trajectory history. During training, we sample a state-action pair from the trajectory at each step, and then apply a cross-entropy loss for optimization:
The full pre-training objective is:
Other loss designs can be considered for the pre-training objective. Our initial results on masked image modeling did not show better results, and thus are excluded in the experiments.
4.4 Pre-training Datasets
We construct our pre-training dataset based on the Matterport3D Simulator, a photo-realistic visual reinforcement learning (RL) simulation environment for the development of intelligent agents based on the Matterport3D dataset . Specifically, it consists of two sets: The training datasets of R2R, which has 104K image-text-action triplets; we employed the Speaker model in  to synthesize 1,020K instructions for the shortest-path trajectories on the training environments. It leads to 6,482K image-text-action triplets. Therefore, the pre-training dataset size is 6,582K.
5 Adapting to new tasks
We focus on three downstream VLN tasks that are based on the Matterport3D simulator. Each task poses a very different challenge to evaluate the agent. The R2R task is used as an in-domain task; it can verify the agent’s generalization capability to unseen environments. CVDN and HANNA are considered as out-of-domain tasks, to study the generalization ability of our agent to new tasks. More specifically, CVDN considers indirect instructions (\ie, dialog history), and HANNA is an interactive RL task.
In R2R, the goal is to navigate from a starting position to a target position with the minimal trajectory length, where the target is explicitly informed in a language instruction. To use the pre-trained model for fine-tuning in R2R, the attended contextualized wording embeddings are fed into an LSTM encoder-decoder framework, as in [9, 16]. In prior work, random initialization is used in , and BERT is used in . In contrast, our word embeddings are pre-trained from scratch with VLN data and tasks.
5.2 Cooperative Vision-and-Dialogue Navigation
In the CVDN environment, the Navigation from Dialog History (NDH) is defined, where agent searches an environment for a goal location, based on the dialog history that consists of multiple turns of question & answering interactions between the the agent and to its partner. The partner has privileged access to the best next steps that the agent should take according to a shortest path planner. CVDN is more challenging than R2R, in that the instructions from the dialog history are often ambiguous, under-specified, and indirect to the final target. The fine-tuning model architecture for CVDN is the same as R2R, except that CVND usually has much longer text input. We limit the sequence length to 300. Words that are longer than 300 in a dialog history are removed.
5.3 HANNA: Interactive Imitation Learning
HANNA simulates a scenario, where a human requester asks an agent via language to find an object in an indoor environment, without specifying the process of how to complete the task. The only source of help the agent can leverage in the environment is the assistant, who helps the agent by giving subtasks in the form of a natural language instruction that guides the agent to a specific location, and an image of the view at that location. When the help mode is triggered, we use our pre-trained model to encode the language instructions, and the features are used for the rest of their system.
6 Experimental Results
6.1 Training details
We pre-train the proposed model on eight V100 GPUs, the batch size for each GPU is 96. The AdamW optimizer  is used, and the learning rate is . The total training epochs is 20.
The fine-tuning is performed on NVIDIA 1080Ti GPU. For the R2R task, we follow the same learning schedule as . When training the augmented listener, we use batch size 20. We continue to fine-tune the cross-attention encoder for 20k iterations, with the batch size 10 and learning rate . For the NDH task, we follow the same learning schedule as in , and choose the batch size as 15 and learning rate as . For HANNA, the training schedule is the same as . The batch size is 32 and learning rate is .
The R2R dataset  consists of 10,800 panoramic views (each panoromic view has 36 images) and 7,189 trajectories. Each trajectory is paired with three natural language instructions. The R2R dataset consists of four splits: train, validation seen and validation unseen, test unseen. The challenge of R2R is to test the agent’s generalization ability in unseen environments.
The performance of different agents is evaluated using the following metrics:
Trajectory Length measures the average length of the navigation trajectory.
Navigation Error is the mean of the shortest path distance in meters between the agent’s final location and the target location.
Success Rate is the percentage of the agent’s final location that is less than 3 meters away from the target location.
Success weighted by Path Length  trades-off SR against TL. Higher score represents more efficiency in navigation.
Among these metrics, SPL is the recommended primary metric, and other metrics are considered as auxiliary measures.
|Validation Seen||Validation Unseen||Test Unseen|
We compare our approach with nine recently published systems:
Random: an agent that randomly selects a direction and moves five step in that direction .
S2S-Anderson: a sequence-to-sequence model using a limited discrete action space .
RPA : is an agent which combines model-free and model-based reinforcement learning, using a look-ahead module for planning.
Speaker-Follower : an agent trained with data augmentation from a speaker model on the panoramic action space.
Smna : an agent trained with a visual-textual co-grounding module and a progress monitor on the panoramic action space.
RCM+SIL : an agent trained with cross-modal grounding locally and globally via RL.
Regretful : an agent with a trained progress monitor heuristic for search that enables backtracking.
Fast : an agent that uses a fusion function to score and compare partial trajectories of different lengths, which enables the agent to efficiently backtrack after a mistake.
EnvDrop : an agent is trained with environment dropout, which can generate more environments based on the limited seen environments.
PreSS : an agent is trained with pre-trained language models and stochastic sampling to generalize well in the unseen environment.
Comparison with SoTA
Table 1 compares the performance of our agent against the existing published top systems.
In PreSS , multiple introductions are used. To have a fair comparison, we follow , and report Prevalent results. We see that testing SPL is improved. Further, the gap between seen and unseen environments of Prevalent is smaller than PreSS, meaning that image-attended language understanding is more effective to help the agent generalize better to unseen environment.
|Validation Unseen||Test Unseen|
|Shortest Path Agent||8.36||7.99||9.58||8.06||8.48||9.76|
6.3 Cooperative Vision-and-Dialogue Navigation
Dataset & Evaluation Metric
The CVDN dataset has 2050 human-human navigation dialogs, comprising over 7K navigation trajectories punctuated by question-answer exchanges, across 83 MatterPort houses  . The metrics for R2R can be readily used for the CVDN dataset. Further, one new metric is proposed for the NDH task:
Goal Progress measure the difference between completed distance and left distance to the goal. Larger values indicate a more efficient agent.
Three settings are considered, depending on which ground-truth action/path is employed . Oracle indicates the shortest path, Navigator indicates the path taken by the navigator. The Mixed supervision path means to take the navigator path if available, otherwise the shortest path.
The results are in Table 2. The proposed Prevalent significantly outperforms the Seq2Seq baseline on both validation and testing unseen environments in all settings, leading to the top position on the leaderboard
Dataset & Evaluation Metric
The HANNA dataset features 289 object types; the language instruction vocabulary contains 2,332 words. The numbers of locations on the shortest paths to the requested objects are restricted to be between 5 and 15. With an average edge length of 2.25 meters, the agent has to travel about 9 to 32 meters to reach its goals. Similar to R2R, SR, SPL and NE are used to evaluate the navigation. Further, one new metric is considered for this interactive task:
Number of requests measure the how many helps are requested by the agent.
The results are shown in Table 3. Two rule-based methods and two skyline methods are reported as references, see  for details. Our Prevalent outperforms the baseline agent Anna on the test unseen environments in terms of SR, SPL and NE, while requesting a slightly higher number of helps (#R). When measuring the performance gap between seen and unseen environments, we see that Prevalent shows a significantly smaller difference than Anna, \eg, (59.38-28.72=30.66) vs (63.92-25.50=38.42) for SPL. This means that the pre-trained joint representation by Prevalent can reduce over-fitting, and generalise better to unseen environments.
6.5 Ablation Studies
|Navigation QA||Oracle Answer||All|
|Validation Seen||Validation Unseen||Test Unseen|
Is pre-training with actions helpful?
Our pre-training objective in (9) includes two losses, and . To study the impact of each loss, we pre-train two model variants: one is based on the full objective , the other only uses . To verify its impact on new tasks, we consider CVDN first, and the results are shown in Table 4. Three types of text inputs are considered: Navigation QA, Orcale Answer, and All (a combination of both). More details are provided in the Appendix.
When is employed in the objective, we see consistent improvement on nearly all metrics and settings. Note that our MLM is different from BERT in that the attention over images is used in the cross-layer. To verify whether the image-attended learning is necessary, we consider BERT in two ways. BERT pre-training: we apply the original MLM loss in BERT on our R2R pre-training dataset. The newly pre-trained BERT is used for fine-tuning on CVDN. BERT fine-tuning: we directly fine-tune the off-shelf BERT on CVDN. Their performances are lower than the two variants of the proposed Prevalent. This means our image-attended MLM is more effective for navigation tasks. More ablation studies on the pre-training objectives are conducted for HANNA, with results shown in the Appendix.
Feature-based vs Fine-tuning
The pre-trained encoder can be used in two modes: fine-tuning approach, where a task-specific layer is added to the pre-trained model, and all parameters are jointly updated on a downstream task. feature-based approach, where fixed features are extracted from the pre-trained model, and only the task-specific layer is updated. In this paper, all Prevalent results presented generally have used the feature-based approach, as there are major computational benefits to pre-compute an expensive representation of the training data once, and then run many experiments with cheaper models on top of this representation. In the R2R dataset, we consider a two-stage scheme, where we fine-tune the cross-attention layers of the agent, after training via the feature-based approach. The results are reported in Table 5. We observe notable improvement with this two-stage scheme on nearly all metrics, expect the trajectory length.
How does pre-training help generalization?
We plot the learning curves on the seen/unseen environments for R2R in Figure 4(a), and CVDN in Figure 4(b). Compared with the random initialized word embeddings in EnvDrop , the pre-trained word embeddings can adapt faster (especially in the early stage), and converges to higher performance in unseen environments. This is demonstrated by the SPL values in the Figure 4(a). By comparing the learning curves in Figure 4(b), we see a much smaller gap between seen and unseen environments for Prevalent than the Seq2Seq baseline , meaning pre-training is an effective tool to help reduce over-fitting in learning.
We present Prevalent, a new pre-training and fine-tuning paradigm for vision-and-language navigation problems. This allows for more effective use of the limited training data to improve generalization to the previously unseen environments, and new tasks. The pre-trained encoder can be easily plugged into existing models to boost their performance. Empirical results on three benchmarks (R2R, CVDN and HANNA) demonstrate that Prevalent significantly improves over the existing methods, achieving new state-of-the-art performance.
Supplementary Material: Towards Learning a Generic Agent for
Vision-and-Language Navigation via Pre-training
Summary of Contributions.
Weituo implemented the algorithm, made the model work, and ran all experiments. Chunyuan initiated the idea of pre-training the first generic agent for VLN, led and completed the manuscript writing. Xiujun provided the codebase and helped implementation. Lawrence and Jianfeng edited the final manuscript.
Appendix A Experiments
Three types of inputs on CVDN
We illustrate the naming of three types of text inputs on CVDN in Table 6.
Ablation Study Results on HANNA
Table 7 shows the results with different pre-training objectives. We see that the yields the best performance among all variants.
Appendix B Comparison with Related Work
Comparison with Press.
The differences are summarized in the Table below. Empirically, we show that (1) incorporating visual and action information into pre-training can improve navigation performance; (2) Pre-training can generalize across different new navigation tasks.
Comparison with vision-language pre-training (VLP).
The differences are in the table below. Though the proposed methodology generally follows self supervised learning such as VLP or BERT, our research scope and problem setups are different, which rendering existing pre-models are not readily applicable.
- Pre-trained vision-and-language based navigator
- Among all the public results at the time of this submission.
- The full list of leaderboard is publicly available: https://evalai.cloudcv.org/web/challenges/challenge-page/97/leaderboard/270
- The full list of leaderboard is publicly available: https://evalai.cloudcv.org/web/challenges/challenge-page/463/leaderboard/1292
- (2018) On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757. Cited by: item SPL.
- (2018) Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Cited by: §2.
- (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In CVPR, Vol. 2. Cited by: Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training, §1, §1, §1, §3, 1st item, 2nd item, §6.2.
- (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.1.
- (2017) Matterport3D: learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV). Cited by: §4.4, §6.3.
- (2010) Touchdown: natural language navigation and spatial reasoning in visual street environments. CVPR. Cited by: §1.
- (2018) Embodied question answering. In CVPR, Cited by: §1.
- (2018) BERT: pre-training of deep bidirectional transformers for language understanding. NAACL. Cited by: §2, §2.
- (2018) Speaker-follower models for vision-and-language navigation. NIPS. Cited by: §1, §2, §3, §4.1, §4.4, §5.1, 4th item.
- (2015) Fast R-CNN. In CVPR, Cited by: §2.
- (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1.
- (2019) Tactical rewind: self-correction via backtracking in vision-and-language navigation. CVPR. Cited by: §2, 8th item.
- (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.1.
- (2017) AI2-THOR: an interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474. Cited by: §1.
- (2019) Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066. Cited by: §2.
- (2019) Robust navigation with language pretraining and stochastic sampling. EMNLP. Cited by: §2, §5.1, 10th item, §6.2.
- (2019) VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NIPS. Cited by: §2.
- (2019) Self-monitoring navigation agent via auxiliary progress estimation. ICLR. Cited by: §1, §2, 5th item.
- (2019) The regretful agent: heuristic-aided navigation through progress estimation. CVPR. Cited by: §2, 7th item.
- (2017) Mapping instructions and visual observations to actions with reinforcement learning. EMNLP. Cited by: §1.
- (2017) Colors in context: a pragmatic neural model for grounded language understanding. TACL. Cited by: §1.
- (2019) Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. EMNLP. Cited by: Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training, §1, §6.1, §6.4.
- (2017) MINOS: multimodal indoor simulator for navigation in complex environments. arXiv preprint arXiv:1712.03931. Cited by: §1.
- (2019) VL-BERT: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §2.
- (2019) VideoBERT: a joint model for video and language representation learning. ICCV. Cited by: §2.
- (2014) Sequence to sequence learning with neural networks. In NIPS, Cited by: §1.
- (2019) LXMERT: learning cross-modality encoder representations from transformers. EMNLP. Cited by: §2, §4.2.
- (2019) Learning to navigate unseen environments: back translation with environmental dropout. EMNLP. Cited by: §2, 9th item, §6.1, §6.5.
- (2019) Shifting the baseline: single modality performance on visual navigation & qa. In NAACL, Cited by: §3.
- (2019) Vision-and-dialog navigation. CoRL. Cited by: Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training, §1, §6.1, §6.3, §6.5.
- (2017) Attention is all you need. In NIPS, Cited by: §4.2.
- (2019) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. CVPR. Cited by: §1, §2, 6th item.
- (2018) Look before you leap: bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. ECCV. Cited by: §1, §2, 3rd item.
- (2020) Unified vision-language pre-training for image captioning and VQA. AAAI. Cited by: §2.