Learning Interpretable Spatial Operations in a Rich 3D Blocks World
In this paper, we study the problem of mapping natural language instructions to complex spatial actions in a 3D blocks world. We first introduce a new dataset that pairs complex 3D spatial operations to rich natural language descriptions that require complex spatial and pragmatic interpretations such as “mirroring”, “twisting”, and “balancing”. This dataset, built on the simulation environment of  , attains language that is significantly richer and more complex, while also doubling the size of the original dataset in the 2D environment with 100 new world configurations and 250,000 tokens. In addition, we propose a new neural architecture that achieves competitive results while automatically discovering an inventory of interpretable spatial operations (Figure 10).
One of the longstanding challenges of AI, first introduced as SHRDLU in early 70s , is to build an agent that can follow natural language instructions in a physical environment. The ultimate goal is to create systems that can interact in the real world using rich natural language. However, due to the complex interdisciplinary nature of the challenge , which spans across several fields in AI, including robotics, language, and vision, most existing studies make varying degrees of simplifying assumptions.
On one end of the spectrum is rich robotics paired with simple constrained language , as acquiring a large corpus of natural language grounded with a real robot is prohibitively expensive . On the other end of the spectrum are approaches based on simulation environments, which support broader deployment at the cost of unrealistic simplifying assumptions about the world . In this paper, we seek to reduce the gap between two complementary research efforts by introducing a new level of complexity to both the environment and the language associated with the interactions.
Lifting Grid Assumptions We find that language situated in a richer world leads to richer language. One such example is presented in Figure ?. To correctly place the UPS block, the system must understand the complex physical, spatial, and pragmatic meaning of language including: (1) the 3D concept of a tower, (2) that new or fourth are referencing an assumed future, and (3) that mirror implies an axis and reflection. However, concepts such as above are often outside the scope of most existing language grounding systems.
In this work, we introduce a new dataset that allows for learning significantly richer and more complex spatial language than previously explored. Building on the simulator provided by  , we create roughly 13,000 new crowdsourced instructions (9 per action), nearly doubling the size of the original dataset in the 2D blocks world introduced in their previous work. We address the challenge of realism in the simulated data by introducing three crucial but previously absent complexities:
3D block structures (lifting 2D assumptions)
Fine-grained real valued locations (lifting grid assumptions)
Rotational, angled movements (lifting grid assumptions)
Learning Interpretable Operators In addition, we introduce an interpretable neural model for learning spatial operations in the rich 3D blocks world. In particular, in our model instead of using a single layer conditioned on the language for interpreting the operations, we have the model choose which parameters to apply via a softmax over the possible parameter vectors to use. Specifically, by having the model decide for each example which parameters to use, the model picks among 32 different networks, deciding which is appropriate for a given sentence. Learning these networks and when to apply them enables the model to cluster spatial functions. Secondly, by encouraging low entropy in the selector, the model converges to nearly one-hot representations during training. A side effect of this decision is that the final model exposes an API which can be used interactively for focusing the model’s attention and choosing its actions. We will exploit this property when generating plots in Figure 10 showing the meaning of each learned function. Our model is still fully end-to-end trainable despite choosing its own parameters and composeable structure, leading to a modular network structure similar to .
The rest of the paper is organized as follows. We first discuss related work, introduce our new dataset, followed by our new model. We then present empirical evaluations, analysis on the internal representations, and error analysis. We conclude with the discussion for future work.
Advances in robotics, language, and vision are all applicable to this domain. The intersection of robotics and language have seen impressive results in grounding visual attributes , spatial reasoning , and action taking . For example, recent work  has shown how these instructions can be combined with exploration on physical robotics to follow instructions and learn representations online.
Within computer vision Visual Question Answering  has been widely popular. Unfortunately, it is unclear what models are learning and how much they are understanding versus memorizing bias in the training data . Datasets and models have also recently been introduced for visual reasoning  and referring expressions .
Finally, within the language community, interest in action understanding follows naturally from research in semantic parsing . Here, the community has traditionally been focused on more complex and naturally occurring text, though this has not always been possible for the navigation domain.
Simultaneously, work within NLP  and Robotics  returned to the question of action taking and scene understanding in SHRDLU style worlds. The goal with this modern incarnation was to truly solicit natural language from humans without limiting their vocabulary or referents. This was an important step in moving towards unconstrained language understanding.
The largest corpus was provided by  . In this work, the authors presented pairs of scenes with simulated blocks to users of Amazon’s Mechanical Turk. Turkers would then describe actions or instructions that their imagined collaborator needs to perform to transform the input scene into the target (e.g. Moving a block to the side of another). An important aspect of this dataset is that participants assume they are speaking to another human. This means they do not limit their vocabulary, space of references, simplify their grammar, or even write carefully. The annotators assume that whomever will be reading what they submit is capable of error correction, spatial reasoning, and complex language understanding. This provides an important, and realistic, basis for training artificial language understanding agents. Follow-up work has investigated advances to language representations , spatial reasoning , and reinforcement learning approaches for mapping language to latent action sequences .
3Creating Realistic Data
To facilitate closing the gap between simulation and reality, blocks should not have perfect locations, orderings, or alignments. They should have jitter, nuanced alignments, rotations and the haphazard construction of real objects. Figure ? shows example how our new configurations aim to capture that realism (right) as compared to previous work (left). Previous work created target configurations by downsampling MNIST  digits. This enabled them to create interpretable but unrealistic 2D final representations and the order in which blocks were combined was determined by a heuristic to simulate drawing/writing.
In our data, we solicited creations from people around our lab and their children, not affiliated with the project. They built whatever they wanted (open concept domain), in three dimensions, and were allowed to rotate the blocks. For example, the animal on the left is an elephant whose trunk, tail, and legs curve. Additionally, because humans built the configurations, we were able to capture the order in which blocks were placed for a more natural trajectory. Realism brings with it important new challenges discussed below.
Real Valued Coordinate Spaces The discretized world seen in several recent datasets  simplifies spatial reasoning. Simple constructions like
right can be reduced to exact offsets that do not require context specific interpretations (e.g.
right ). In reality, these concepts depend on the scene around them. For example, in the rightmost image of Figure ?, it is correct to say that the McDonald’s block is right of Adidas, but also that SRI is right of Heineken, despite both having different offsets. The modifier mirroring disambiguates the meaning for us.
Semantically Irrelevant Noise It is important to note that with realism comes noise. Occasionally, an annotator may bump a block or shift the scene a little. Despite repeated efforts to clean and curate the data, most people did not consider this noise noteworthy because it was semantically irrelevant to the task. For example, if while performing an action, a nearby block jostles, it does not change the holistic understanding of the scene. For this reason, we only evaluate the placement of the block that “moved the furthest.” This is a baby step towards building models invariant to changes in the scene orthogonal to the goal.
Physics One concession we were forced to make was relaxing physics. Unlike prior work , we insisted that the final configurations roughly adhere to physics (e.g. minimizing overhangs, no floating blocks, limited intersection), but we found volunteers too often gave up if we forced them to build entirely with physics turned on. This also means that intermediary steps that in the real world require a counter-weight can be constructed one step at a time.
Language Our new corpus contains nearly all of the concepts of previous work, but introduces many more. Figure ? shows the most common relations in prior work, and the most common new concepts. We see that these predominantly focus on rotation (degrees, clockwise, ...) and 3D construction (arch, balance, ...), but higher level concepts like mirroring or balancing pose fundamentally new challenges.
Our new dataset comprises 100 configurations split 70-20-10 between training, testing, and development. Each configuration has between five and twenty steps (and blocks). We present type and token statistics in Table 1, where we use NLTK’s  treebank tokenizer. This yields higher token counts in previous works due to different assumptions about punctuation.
Not all of our annotators made use of the full 20 blocks. As such, we have fewer utterances than the original dataset for the same number of goal configurations. Yet, we find that the instructions for completing our tasks are more nuanced and therefore result in slightly longer sentences on average. Finally, we note that while the datasets are similar, there are significant enough differences that one should not simply assume that training on the combined dataset will necessarily yield a “better” model on either one individually. There are important linguistic and spatial reasoning differences between the two that make our proposed data much more difficult. We present all modeling results on both subsets of the data and the full combined dataset.
3.2Evaluation and Angles
We follow the evaluation setup by prior work and evaluate by reporting the average distance ( in block lengths) between where a block should be placed and the model’s prediction. This metric naturally extends to 3D.
We also devise a metric for evaluating rotations. In our released data,
In the following example, nine instructions (three per annotator) are provided for the proper placement of McDonald’s. We see a diverse set of concepts that include counting, abstract notions like mirror or parallel, geometric concepts like a square or row, and even constraints specified by three different blocks.
Later in the same task, the agent will be asked to rotate a block and place it between the two stacks. We present here just a few excerpts wherein the same action is described in five different ways.
To complete these instructions requires understanding angles, a new set of verbs (rotate, spin, ...), and references to the block’s previous orientation. The final example, indicates that a spin is necessary, but assumes the goal of having it balance between the two stacks is sufficient information to choose the right angle.
The world knowledge and concepts necessary to complete this task are well beyond the ability of any systems we are currently aware of or expect to be built in the near future. Our goal is to provide data and an environment which more accurately reflects the complexity of grounding language to actions. Where previous work broadened the community’s understanding of the types of natural language people use by recreating a blocks world with real human annotators, we felt they did not go far enough in really covering the space of actions and therefore language naturally present in even this constrained world.
In addition to our dataset, we propose an end-to-end trainable model that is both competitive in performance and has an easily interpretable internal representation. The model takes in a natural language instruction for block manipulation and a 3D representation of the world as input, and outputs where the chosen block should be moved. The model can be broken down into three primary components:
Language Encoding for Block and Operation prediction
Applying a spatial operation
Predicting a coordinate in space.
Our overall model architecture is shown in Figure 1. By keeping the model modular we can both control the bottlenecks that learning must use for representation and provide ourselves post hoc access to interpretable scene and action representations (explored further in interpretability section). Without these, the model allows sentences and operations to be represented by arbitrary N-dimensional vectors.
As is common, we use bidirectional LSTMs  to encode the input sentence. We use two LSTMs: one for predicting blocks to attend to, one for choosing the operations to apply. Both LSTMs share a vocabulary embedding matrix, but have no other means of communication. We experimented with using a single LSTM as well as conditioning one on the other, but found it degraded performance.
Once we have produced a representation for arguments and operations , we multiply each by their own feed-forward layers, then softmax to produce a distribution over 20 blocks and 32 operations for and , respectively.
Argument Softmax The first output of our model is an attention over the block IDs. The input world is represented by a 3D tensor of IDs.
Operation Softmax The second distribution we predict is over functions for spatial relations. Here the model needs to choose how far and in what directions to go from the blocks it has chosen to focus on. Unfortunately, there is no a priori set of such functions as we have specifically chosen not to try and pretrain/bias the model in this capacity, so the model must perform a type of clustering where it simultaneously chooses a weighted sum of functions and trains their values.
As noted previously, for the sake of interpretability, we force the encoding for operations () to be a latent softmax distribution over 32 logits. The final operation vector that is passed along to the convolutional model is computed as:
Here, is a set of 32 basis vectors. The output vector is a weighted average across all 32 basis vectors, using to weight each individual basis. The goal of this formulation is such that each of the 32 basis vectors will be independently interpretable by replacing with a 1-hot vector, allowing us to see what type of spatial operation each vector represents. The choice of 32 basis vectors was an empirical one. We only experimented with powers of two, but it is quite likely a more optimal value exists.
4.2Predicting a location
The second half of our pipeline features a convolutional model that combines the encoded operation and argument blocks with the world representation to determine the final location of the block-to-move.
Given the aforementioned argument attention map (tensor of size , our model starts by applying the operation vector at every location of the map, weighted by each location’s attention score. This creates a world representation of size . We then pass this world through two convolutional layers using
In order to predict the final location for the block-to-move, we apply a final convolutional layer to predict offsets and their respective confidences for each location relative to a coordinate grid (8 values total). The coordinate grid is a constant 3D tensor generated by uniformly sampling points across each coordinate axis to achieve the desired resolution. Given the coordinate grid, the goal of the learned convolutional model is to, at every sampled point, predict offsets for , , , , as well as a confidence for each predicted offset. This formulation was similarly used for keypoint localization in . Let be the coordinates for all sampled grid points at grid location and let and be the respective offsets and confidences, then the final predicted coordinate for the block-to-move is computed as:
Here, confidences are softmax normalized across all grid points. Predictions for , are computed similarly. We compute without a coordinate grid such that: .
Our model is trained end-to-end using Adam  with a batch size of 32. The convolutional aspect of the model has 3 layers and operates on a world representation of dimensions 32 4 64 64 32 (batch, depth, height, width, channels). The first convolutional layer uses a filter of size 4 5 5 and the second of size 4 3 3, each followed by a
tanh nonlinearity for the 3D model
5Interpretability and Visualizing the Model
One of the features of our model is its interpretability, which we ensured by placing information bottlenecks within the architecture. By designing the language-to-operation encoding process as predicting a probability distribution over a set of learned basis vectors, we can interpret each vector as a separate operation and visualize the behaviors of each operation vector individually.
Visualizing Operations We generated Figure 10 by placing a single block in the world and moving it around a 9 by 9 grid and passing a 1-hot operation choice vector to our model. We then plot a vector from the block’s center to the predicted target location. We see many simple and expected relationships (left, right, ...), but importantly we see the operations are location specific functions, not simply offsets. Operations on the edges of the world are more fine-grained and many move directly to a region of the world (e.g. 9 = “center”), not simply an offset. It is also possible that some of the more dramatic edge vectors may serve as a failsafe mechanism for misinterpreted instructions. In particular, nearly all of the operations when applied in the bottom right corner redirect to the center of the board rather than off of it.
Additionally, while shown here in 2D, all of our predictions are actually in 3D and contain rotation predictions. In Figure 10 the operations denoting directly on top are the figures with the shortest arrows (e.g. Operation 14).
Interpolating Operations The 1-hot operations can be treated like API calls where several can be called at the same time and interpolated. Figure 2 shows the predicted offsets when interpolating operations 23 (north) and 26 (east). There are two important takeaways from this. First, we see that when combined, we can sweep out angles in the first quadrant to reference them all. Second, we see that magnitudes and angles change as we move closer to the edges of the world. This result is intuitive and desired. Specifically, a location like “to the right” has a variable interpretation depending on how much space exists in the world, and the model is trying to make sure not to push a block off the table. In practice, our analysis found very few clear cases of the model using this power. More commonly, mass would be split between two very similar operations or the sentence was a compound construction (left of X and above Y). We did find that operation 11 correlated with the description between but it is difficult to divine why from the grid. An important extension for future work will be to construct a model which can apply multiple operations to several distinct arguments.
Linguistic Paraphrase Using the validation data, we clustered sentences by their predicted operation vectors. To pick out phrases we only look at sentences with very low entropy distributions (highly confident) and we present our findings in Table ?. We see that specifications range from short one word indicators (e.g. below) to nearly complete sentences (on the east side of the nvidia cube so that there is no space in between them). This also touches on the fact that several operations have the same direction but different magnitudes. Specifically, operation 23 means far above, not directly, and we see this in the visualized grid as well.
|v1 + v2||95.9||1.0||0.50||1.10||0.51|
|v1 + v2 v1||98.1||0.8||0.15||0.84||0.15|
|v1 + v2 v2||93.1||1.2||0.88||1.35||0.91|
In Table 2, we compare our model against existing work, and evaluate on both the original Blocks data (v1) and our new corpus (v2). While our primary goal was the creation of an interpretable model and the introduction of new spatial and linguistic phenomena, it is important to see that our model also performs well. We note three important results:
First, we see that our model outperforms the original model of Bisk 16, and is only slightly weaker than Pišl 17. Our technique does outperform theirs when given the correct source block, so it is possible that we can match their performance with tuning.
Second, our results indicate that the new data (v2) is harder than v1, both in terms of isolating the correct block to move (91 vs 98% accuracy) and average error (1.15 vs 0.80) on the End-to-End setting. Further, a model trained on the union of our corpora improved in source prediction on both the v1 and v2 test sets, but target location performance was either unaffected or slightly deteriorated. This indicates to us that the new dataset is in fact complementary and adds new constructions.
Finally, our model has an average error of 0.058 radians (three degrees). In validation, 46% of predictions require a rotation. 1,374 of 1491 predictions are within 2 degrees of the correct orientation. The remainder have dramatically larger errors (36 at 30, 81 at 45). This means that the model is learning to interpret the scene and utterance correctly in the vast majority of cases.
Several of our model’s worst performing examples are included in Table ?. The model’s error is presented alongside the goal configuration and misunderstood instruction.
The first example specifies the goal location using an abstract concept (tower) and the offset (equidistant) implies recognition of a larger pattern. The second example specifies the goal location in terms of “the 4 stacks”, again without naming any of them and in 3D. Finally, the third demonstrates a particularly nice phenomenon in human language where a plan is specified, the speaker provides categorizing information to enable its recognition, and then can use this newly defined concept as a referent. No models to our knowledge have the ability to dynamically form new concepts in this manner.
Rotations Despite a strong performance by the model on rotations, there are a number of cases that were completely overlooked. Upon inspection, these appear to be predominantly cases where the rotation is not explicitly mentioned, but instead assumed or implied:
place toyota on top of sri in the same direction .
take toyota and place it on top of sri .
... making part of the inside of the curve of the circle .
The first two should be the focus of immediate future work as they only require trusting that a new block should trust the orientation of an existing one below it unless there is a compelling reason (e.g. balance) to rotate it. The third case, returns to our larger discussion on understanding geometric shapes and is probably out of scope for most approaches.
This work presents a new model which moves beyond simple spatial offset predictions (+x, +y, +z) to learn functions which can be applied to the scene. We achieve this without losing interpretability. In addition, we introduce a new corpus of 10,000 actions and 250,000 tokens which contains a plethora of new concepts (subtle movements, balance, rotation) to advance research in action understanding.
We thank the anonymous reviewers for their many insightful comments. This work was supported in part by the NSF grant (IIS-1703166), DARPA CwC program through ARO (W911NF-15-1-0543), and gifts by Google and Facebook.
- In principle, we could work over an RGB rendering of the world, but doing so would add layers of vision complexity that do not help address the dominant language understanding problems.
- We did not perform a grid search for parameters, but we did find the 2D model performed better when a
reluwas used and Batch-Normalization . Finally, the depth values and kernel were set to 1 when training exclusively in 2D.
Andreas, J., and Klein, D. Alignment-based compositional semantics for instruction following.
Andreas, J.; Rohrbach, M.; Darrell, T.; and Klein, D. Learning to compose neural networks for question answering.
Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C. L.; and Parikh, D. VQA: Visual Question Answering.
Artzi, Y., and Zettlemoyer, L. S. Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions.
Bird, S.; Klein, E.; and Loper, E. Natural Language Processing with Python.
Bisk, Y.; Marcu, D.; and Wong, W. Towards a dataset for human computer communication via grounded language acquisition.
Bisk, Y.; Yuret, D.; and Marcu, D. Natural language communication with robots.
Guadarrama, S.; Riano, L.; Golland, D.; Göhring, D.; Yangqing, J.; Klein, D.; Abbeel, P.; and Darrell, T. Grounding Spatial Relations for Human-Robot Interaction .
Harnad, S. The symbol grounding problem.
Hochreiter, S., and Schmidhuber, J. Long short-term memory.
Ioffe, S., and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift.
Johnson, J.; Hariharan, B.; van der Maaten, L.; Hoffman, J.; Fei-Fei, L.; Zitnick, C. L.; and Girshick, R. Inferring and executing programs for visual reasoning.
Kazemzadeh, S.; Ordonez, V.; Matten, M.; and Berg, T. L. Referitgame: Referring to objects in photographs of natural scenes.
Kingma, D., and Ba, J. Adam: A method for stochastic optimization.
Kollar, T.; Krishnamurthy, J.; and Strimel, G. Toward Interactive Grounded Language Acquisition.
LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. Gradient-based learning applied to document recognition.
Li, S.; Scalise, R.; Admoni, H.; Rosenthal, S.; and Srinivasa, S. S. Spatial references and perspective in natural language instructions for collaborative manipulation.
MacMahon, M.; Stankiewicz, B.; and Kuipers, B. Walk the talk: Connecting language, knowledge, and action in route instructions.
Mao, J.; Huang, J.; Toshev, A.; Camburu, O.; Yuille, A. L.; and Murphy, K. Generation and comprehension of unambiguous object descriptions.
Matuszek, C.; Bo, L.; Zettlemoyer, L. S.; and Fox, D. Learning from Unscripted Deictic Gesture and Language for Human-Robot Interactions.
Misra, D. K.; Sung, J.; Lee, K.; and Saxena, A. Tell me dave: Context-sensitive grounding of natural language to manipulation instructions.
Misra, D.; Langford, J.; and Artzi, Y. Mapping instructions and visual observations to actions with reinforcement learning.
Pišl, B., and Mareček, D. Communication with robots using multilayer recurrent networks.
Ribeiro, M. T.; Singh, S.; and Guestrin, C. Nothing else matters: Model-agnostic explanations by identifying prediction invariance.
Roy, D., and Reiter, E. Connecting language to the world.
Roy, D. K. Learning visually grounded words and syntax for a scene description task.
Santoro, A.; Raposo, D.; Barrett, D. G.; Malinowski, M.; Pascanu, R.; Battaglia, P.; and Lillicrap, T. A simple neural network module for relational reasoning.
Schuster, M., and Paliwal, K. K. Bidirectional recurrent neural networks.
Singh, S.; Hoiem, D.; and Forsyth, D. Learning to localize little landmarks.
Steels, L., and Vogt, P. Grounding Adaptive Language Games in Robotic Agents.
Tan, H., and Bansal, M. Source-target inference models for spatial instruction understanding.
Tellex, S.; Kollar, T.; Dickerson, S.; Walter, M. R.; Banerjee, A. G.; Teller, S.; and Roy, N. Understanding natural language commands for robotic navigation and mobile manipulation.
Thomason, J.; Zhang, S.; Mooney, R.; and Stone, P. Learning to interpret natural language commands through human-robot dialog.
Thomason, J.; Sinapov, J.; Svetlik, M.; Stone, P.; and Mooney, R. Learning multi-modal grounded linguistic semantics by playing “I spy”.
Thomason, J.; Padmakumar, A.; Sinapov, J.; Hart, J.; Stone, P.; and Mooney, R. J. Opportunistic active learning for grounding natural language descriptions.
Wang, S. I.; Ginn, S.; Liang, P.; and Manning, C. D. Naturalizing a programming language via interactive learning.
Wang, S. I.; Liang, P.; and Manning, C. D. Learning language games through interaction.
Winograd, T. Procedures as a representation for data in a computer program for understanding natural language.
Yu, H., and Siskind, J. M. Grounded language learning from video described with sentences.