Touchdown: Natural Language Navigation and Spatial Reasoning
in Visual Street Environments
We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task. We introduce the Touchdown task and dataset, where an agent must first follow navigation instructions in a real-life visual urban environment to a goal position, and then identify in the observed image a location described in natural language to find a hidden object. The data contains examples of English instructions and spatial descriptions paired with demonstrations. We perform qualitative linguistic analysis, and show that the data displays richer use of spatial reasoning compared to related resources. Empirical analysis shows the data presents an open challenge to existing methods.
Consider the visual challenges of following natural language instructions in a busy urban environment. Figure 1 illustrates this problem. The agent must identify objects and their properties to resolve mentions to traffic light and American flags, identify patterns in how objects are arranged to find the flow of traffic, and reason about how the relative position of objects changes as it moves to go past objects. Reasoning about vision and language has been studied extensively with various tasks, including visual question answering [Antol:15vqa, Zitnick:13abstract], visual navigation [Anderson:17, Misra:18goalprediction], interactive question answering [Das:17eqa, gordon2018iqa], and referring expression resolution [Kazemzadeh:14, Mao:16googleref, Matuszek:12]. However, existing work has largely focused on relatively simple visual input, including object-focused photographs [Lin:14coco, Reed:16] or simulated environments [Bisk:16dataset, Das:17eqa, Kolve:17, Misra:18goalprediction, Yan:18chalet]. While this has enabled significant progress in visual understanding, the use of real-world visual input not only increases the challenge of the vision task, it also drastically changes the kind of language it elicits and requires fundamentally different reasoning.
In this paper, we study the problem of reasoning about vision and natural language using an interactive visual navigation environment based on Google Street View.111https://developers.google.com/maps/documentation/streetview/intro We design the task of first following instructions to reach a goal position, and then resolving a spatial description at the goal by identifying the location in the observed image of Touchdown, a hidden teddy bear. Using this environment and task, we release Touchdown,222 Touchdown is the unofficial mascot of Cornell University. a dataset for navigation and spatial reasoning with real-life observations.
We design our task for diverse use of spatial reasoning, including for following navigation instructions in the first part and resolving the spatial description to find Touchdown. The navigation portion requires the agent to reason about its relative position to objects and how these relations change as it moves through the environment. In contrast, understanding the description of the location of Touchdown requires the agent to reason about the spatial relations between observed objects. The two tasks also diverge in their learning challenges. While in both learning requires relying on indirect supervision to acquire spatial knowledge and language grounding, for navigation, the training data includes demonstrated actions, and for spatial description resolution, annotated target locations. The task can be addressed as a whole, or decomposed to its two portions.
The key data collection challenge is designing a scalable process to obtain natural language data that reflects the richness of the visual input while discouraging overly verbose and unnatural language. In our data collection process, workers write and follow instructions. The writers navigate in the environment and hide Touchdown. Their goal is to make sure the follower can execute the instruction to find Touchdown. The measurable goal allows us to reward effective writers, and discourages overly verbose descriptions.
We collect examples of the complete task, which decompose to the same number of navigation tasks and spatial description resolution (SDR) tasks. Each example is annotated with a navigation demonstration and the location of Touchdown. Our linguistically-driven analysis shows the data requires significantly more complex reasoning than related corpora. Nearly all examples require resolving spatial relations between observable objects and between the agent and its surroundings, and each example contains on average commands and refers to unique entities in its environment.
We empirically study the navigation and SDR tasks independently. For navigation, we focus on the learning challenges and the effectiveness of supervised and exploration-driven learning methods. For SDR, we cast the problem of identifying Touchdown’s location as an image feature reconstruction problem using a language-conditioned variant of the UNet architecture [DBLP:journals/corr/RonnebergerFB15]. This approach significantly outperforms several strong baselines.
We make the environment and data available at https://github.com/clic-lab/touchdown, and discuss the licensing and release details in Section LABEL:sec:dist.
2 Related Work and Datasets
Jointly reasoning about vision and language has been studied extensively, most commonly focusing on static visual input for reasoning about image captions [Chen:15coco, Lin:14coco, Suhr:17visual-reason, Reed:16] and grounded question answering [Antol:15vqa, goyal2017making, Zitnick:13abstract]. Recently, the problem has been studied in interactive simulated environments where the visual input changes as the agent acts, such as interactive question answering [Das:17eqa, gordon2018iqa] and instruction following [Misra:18goalprediction, Misra:17instructions]. In contrast, we focus on an interactive environment with real-world observations.
The most related resources to ours are R2R [Anderson:17] and Talk the Walk [devries:18]. R2R uses panorama graphs of house environments for the task of navigation instruction following. R2R includes unique environments, each containing an average of panoramas, significantly smaller than our panoramas. The larger environment requires following the instructions closely, as finding the goal using search strategies is unlikely, even given a large number of steps. We also observe that the language in our data is significantly more complex than in R2R (Section LABEL:sec:data-analysis). Our environment setup is related to Talk the Walk, which uses panoramas in small urban environments for a navigation dialogue task. In contrast to our setup, the instructor does not observe the panoramas directly, but instead sees a simplified diagram of the environment with a small set of pre-selected landmarks. As a result, the instructor has less spatial information compared to Touchdown. Instead the focus is on agent-to-agent conversational coordination.
SDR is related to the task of referring expression resolution, for example as studied in ReferItGame [Kazemzadeh:14] and Google Refexp [Mao:16googleref]. Referring expressions describe an observed object, mostly requiring disambiguation between the described object and other objects of the same type. In contrast, the goal of SDR is to describe a specific location rather than discriminating. This leads to more complex language, as illustrated by the comparatively longer sentences of SDR (Section LABEL:sec:data-analysis). Kitaev and Klein [kitaev2017misty] proposed a similar task to SDR, where given a spatial description and a small set of locations in a fully-observed simulated 3D environment, the system must select the location described from the set. We do not use distractor locations, requiring a system to consider all areas of the image to resolve a spatial description.
We use Google Street View to create a large navigation environment, where each position includes a 360 RGB panorama. The panoramas are connected in a graph-like structure with undirected edges connecting neighboring panoramas. Each edge connects to a panorama in a specific heading. For each panorama, we render perspective images for all headings that have edges. Our environment includes panoramas and edges from New York City. Figure 2 illustrates the environment.
4 Task Descriptions
We design two tasks: navigation and spatial description resolution (SDR). Both tasks require recognizing objects and the spatial relations between them. The navigation task focuses on egocentric spatial reasoning, where instructions refer to the agent’s relationship with its environment, including the objects it observes. The SDR task displays more allocentric reasoning, where the language requires understanding the relations between the observed objects to identify the target location. The navigation task requires generating a sequence of actions from a small set of possible actions. The SDR task requires choosing a specific pixel in the observed image. Both tasks present different learning challenges. The navigation task could benefit from reward-based learning, while the SDR task defines a supervised learning problem. The two tasks can be addressed separately, or combined by completing the SDR task at the goal position at the end of the navigation.
The agent’s goal is to follow a natural language instruction and reach a goal position. Let be the set of all states. A state is a pair , where I is a panorama and is the heading angle indicating the agent heading. We only allow states where there is an edge connecting to a neighboring panorama in the heading . Given a navigation instruction and a start state , the agent performs a sequence actions. The set of actions is . Given a state and an action , the state is deterministically updated using a transition function . The action moves the agent along the edge in its current heading. Formally, if the environment includes the edge at heading in , the transition is . The new heading is the heading of the edge in with the closest heading to . The () action changes the agent heading to the heading of the closest edge on the left (right). Formally, if the position panorama I has edges at headings , and . Given a start state and a navigation instruction , an execution is a sequence of state-action pairs , where and .
We use three evaluation metrics: task completion, shortest-path distance, and trajectory edit distance. Task completion (TC) measures the accuracy of completing the task correctly. We consider an execution correct if the agent reaches the exact goal position or one of its neighboring nodes in the environment graph. Shortest-path distance (SPD) measures the mean distance in the graph between the agent’s final panorama and the goal. This computation ignores turning actions and the agent heading. Trajectory edit-distance (ED) measures the mean Levenshtein distance between the sequence of panoramas in the agent’s execution and the sequence of panoramas in the annotated demonstration. ED ignores turning actions and the agent heading.
4.2 Spatial Description Resolution (SDR)
Given an image I and a natural language description , the task is to identify the point in the image that is referred to by the description. We instantiate this task as finding the location of Touchdown, a teddy bear, in the environment. Touchdown is hidden and not visible in the input. The image I is a 360 RGB panorama, and the output is a pair of coordinates specifying a location in the image.
We use three evaluation metrics: accuracy, consistency, and distance error. Accuracy is computed with regard to an annotated location. We consider a prediction as correct if the coordinates are within a slack radius of the annotation. We measure accuracy for radiuses of 40, 80, and 120 pixels and use euclidean distance. Our data collection process results in multiple images for each sentence. We use this to measure consistency over unique sentences, which is measured similar to accuracy, but with a unique sentence considered correct only if all its examples are correct. We compute consistency for each slack value. We also measure the mean euclidean distance between the annotated location and the predicted location.