AI2-THOR: An Interactive 3D Environment for Visual AI

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Daniel Gordon, Yuke Zhu, Abhinav Gupta, Ali Farhadi
Allen Institute for AI, University of Washington, Stanford University, Carnegie Mellon University

We introduce The House Of inteRactions (THOR), a framework for visual AI research, available at AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks. AI2-THOR enables research in many different domains including but not limited to deep reinforcement learning, imitation learning, learning by interaction, planning, visual question answering, unsupervised representation learning, object detection and segmentation, and learning models of cognition. The goal of AI2-THOR is to facilitate building visually intelligent models and push the research forward in this domain.

1 Introduction

Humans demonstrate levels of visual understanding that go well beyond current formulations of mainstream vision tasks (e.g., object detection, scene recognition, image segmentation). A key element to visual intelligence is the ability to interact with the environment and learn from those interactions. Current state-of-the-art models in computer vision are trained by using still images or videos. This is different from how humans learn. We introduce AI2-THOR as a step towards human-like learning based on visual input.

AI2-THOR v1.0 consists of 120 near photo-realistic 3D scenes spanning four different categories: kitchen, living room, bedroom, and bathroom. Some example scenes are shown in Figure 1.

There are several key factors that distinguish AI2-THOR from other simulated environments used in computer vision:

  1. The most important factor is that AI2-THOR includes actionable objects. By actionable, we mean the state of the objects can be changed. For example, a microwave can be opened or closed, a loaf of bread can be sliced or a faucet can be turned on. A few examples of interaction are shown in Figure 2.

  2. The scenes in AI2-THOR are near photo-realistic. This allows better transfer of the learned models to the real world. For instance, ATARI games or board games such as GO, which are typically used to demonstrate the performance of AI models, are very different from the real world and lack much of the visual complexity of natural environments.

  3. The scenes are designed manually by 3D artists. Hence, AI2-THOR does not have the bias that exists in the scenes generated automatically.

  4. AI2-THOR provides a Python API to interact with Unity 3D game engine that provides many different functionalities such as navigation, applying forces, object interaction, and physics modeling.

Figure 1: Examples of AI2-THOR scenes.

Real robot experiments are typically performed in lab settings or constrained scenes since deploying robots in various indoor and outdoor scenes is not scalable. This makes training models that generalize to various situations difficult. Additionally, due to mechanical constraints of robot actuators, using learning algorithms that require thousands of iterations is infeasible. Furthermore, training real robots might be costly or unsafe as they might damage the surrounding environment or the robots themselves during training. AI2-THOR provides a scalable, fast and cheap proxy for real world experiments in different types of scenarios.

In the following sections, we explain concepts related to AI2-THOR and describe the overall architecture of the framework and provide tutorial and examples links.

Figure 2: Applying forces to objects (first row) or changing their state (second row) are two examples of actions that AI2-THOR enables.

2 Related Platforms

There are a number of simulation platforms to evaluate AI models. A variety of benchmarks are based on games such as ATARI Learning Environment [4], ViZDoom [13], TorchCraft [19], ELF [20] or DeepMind Lab [3]. OpenAI Universe [1] also provides a toolkit that combines different existing platforms. The main issue with these platforms is that they are not photo-realistic. Moreover, some of them expose the full environment to the agent, while an agent operating in real world does not see the entire environment. For example, a robot that operates in an apartment does not see the entire apartment.

UETorch [14], Project Malmo [12], and SceneNet [11] are other examples of non-photo-realistic simulated environments.

SYNTHIA [16], Virtual KITTI [9], TORCS [21], and CARLA [7] provide synthetic data for autonomous driving. In contrast, we focus on indoor environments and include actionable objects in our scenes.

HoME [5], House3D [2] and MINOS [17] (which are based on SUNCG [18] and/or Matterport3D [6]) and SceneNet RGBD [15] also provide synthetic indoor environments. The critical advantage of AI2-THOR is that it includes actionable objects, which allows the agent to interact with the environment and perform tasks. Those frameworks are mainly suitable for navigation due to lack of actionable objects and an interaction API. Furthermore, in contrast to those frameworks, AI2-THOR is integrated with a physics engine, which enables modeling complex physical interactions based on friction, mass, forces, etc. Table 1 summarizes the features of the most relevant frameworks.

Environment 3D Large-Scale Customizable Physics Photorealistic Actionable
OpenAI Universe [1]
Malmo [12]
DeepMind Lab [3]
VizDoom [13]
MINOS (Matterport3D) [17]
House3D [2]
HoME [5]
Table 1: Comparison with other frameworks.

3 Concepts

The main high-level concepts in AI2-THOR are the followings:

  • Scene: A scene within AI2-THOR represents a virtual room that an agent can navigate in and interact with.

  • Agent: A capsule shaped entity with a radius of and height of that can navigate within scenes and interact with objects. The agent is not permitted to pass through any physical object. If the agent collides with an object during a navigation action, the agent is reset to its previous position.

  • Action: A discrete command for the Agent to perform within a scene (e.g. MoveAhead, RotateRight, PickupObject). Actions fail if their pre-conditions are not met. For example, the ‘Open Microwave’ action will fail if the microwave is not in the vicinity of the agent.

  • Object: An object is a 3D model inside the scene. Objects can be static or movable (pickupable). An example of a static object is cabinet that is attached to a wall, and an example of a movable object is statue. Another categorization of objects is interactable vs non-interactable. Non-interactable objects are mainly for generating clutter and making the scenes more realistic. An example of an interactable object is a drawer that can be opened or closed, and an example of a non-interactable object is a sticky note on the fridge. Non-interactable objects are not included in the metadata that our server sends. There are 87 categories of interactable objects. Each interactable object contains a number of variations. For example, there are 7 different types of bread. There is a randomizer that can be used to change the location of the objects. The variant of the object that is selected per scene is deterministic and is based on the scene number.

  • Receptacles: A type of object that can contain other objects. Receptacles have predetermined locations where objects will be placed called pivots. A receptacle may have one or more pivots, but a pivot cannot hold more than one object at the same time. Attempts to put an object in a full receptacle will result in failure. If there are multiple vacant pivots (e.g., in a fridge), the first available one will be selected for the object to reside. Sinks, refrigerators, cabinets, and tabletops are some examples of receptacles.

  • Object Visibility: An object is said to be visible if it is in camera view and within a threshold of distance (default: 1 meter) when measured from the Agent’s camera to the closest point of the object. This determines whether the agent can interact with the object or not. Note that ‘object visibility’ is not the same as visibility in the image. We consider an object as visible if a fixed-length thick ray that is emitted from the camera hits the object collider.

4 Architecture

AI2-THOR is made up of two components: (1) a set of scenes built within the Unity Game engine, (2) a lightweight Python API that interacts with the game engine.

On the Python side there is a Flask service that listens for HTTP requests from the Unity Game engine. After an action is executed within the game engine, a screen capture is taken and a JSON metadata object is constructed from the state of all the objects of the scene and POST’d to the Python Flask service. This payload is then used to construct an Event object comprised of a numpy array (the screen capture) and metadata (dictionary containing the current state of every object including the agent). At this point the game engine waits for a response from the Python service, which it receives when the next controller.step() call is made. Once the response is received within Unity, the requested action is taken and the process repeats. Figure 3 illustrates the overall architecture of AI2-THOR.

Figure 3: The overall architecture of AI2-THOR.

5 Benchmarks

AI2-THOR can be run in a single-threaded or multi-threaded fashion, where each python thread has an independent Unity process and game state. On an Intel(R) Core(TM) i7-4770 CPU @3.40GHz and NVIDIA Titan X (Maxwell), a typical interaction, including all steps of sending a command and receiving a response, takes 0.07 seconds (13 actions per second). When running 8 simultaneous threads, each thread can perform 7 actions per second. On an Intel(R) Xeon(R) CPU E5-2696 v4 @2.20GHz and a Titan X (Pascal), we achieve 22 actions per second for one thread, and 15 actions per second for 8 threads. The resolution for images is . Changing the resolution affects the performance.

6 Getting Started

A complete tutorial for AI2-THOR that includes installation guide, examples of calling actions, description of metadata, etc. is available at:

7 Example Usage of AI2-THOR

AI2-THOR has been successfully tested for different applications. We used AI2-THOR for the problem of target-driven navigation [23] using deep reinforcement learning and showed that the trained model can be transferred to real robots with some amount of fine-tuning. We also used AI2-THOR for the problem of visual semantic planning using deep successor representations [22] and showed generalization across different tasks. [8] use AI2-THOR to train a generative adversarial network for generating occluded regions of objects. [10] perform interactive visual question answering using AI2-THOR environment.

8 Conclusion & Outlook

We introduce AI2-THOR framework, an open-source interactive 3D platform for AI research. One of the key distinguishing factors of AI2-THOR is actionable objects whose states can be changed by the agents. This enables the agents to perform meaningful tasks in the environment.

We encourage the research community to contribute to AI2-THOR and we appreciate any report of bugs and issues. Non-rigid body physics, multi-agent communication, moving objects, integration with MTurk will be interesting additions to the framework. We welcome any suggestion about additional functionality for the future versions.


  • [1] OpenAI Universe., 2016.
  • [2] Anonymous. Building generalizable agents with a realistic and rich 3d environment. 2017. Under Review.
  • [3] C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, and S. Petersen. Deepmind lab. arXiv, 2016.
  • [4] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. J. of Artificial Intelligence Research, 47:253–279, 2013.
  • [5] S. Brodeur, E. Perez, A. Anand, F. Golemo, L. Celotti, F. Strub, J. Rouat, H. Larochelle, and A. Courville. Home: a household multimodal environment. arXiv, 2017.
  • [6] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 3DV, 2017.
  • [7] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. CARLA: An open urban driving simulator. In CORL, 2017.
  • [8] K. Ehsani, R. Mottaghi, and A. Farhadi. SeGAN: Segmenting and generating the invisible. arXiv, 2017.
  • [9] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, 2016.
  • [10] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi. IQA: Visual question answering in interactive environments. arXiv, 2017.
  • [11] A. Handa, V. Patraucean, S. Stent, and R. Cipolla. Scenenet: An annotated model generator for indoor scene understanding. In ICRA, 2016.
  • [12] M. Johnson, K. Hofmann, T. Hutton, and D. Bignell. The malmo platform for artificial intelligence experimentation. In Intl. Joint Conference on Artificial Intelligence, 2016.
  • [13] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaśkowski. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In IEEE Conference on Computational Intelligence and Games, 2016.
  • [14] A. Lerer, S. Gross, and R. Fergus. Learning physical intuition of block towers by example. In ICML, 2016.
  • [15] J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison. Scenenet RGB-D: 5m photorealistic images of synthetic indoor trajectories with ground truth. In ICCV, 2017.
  • [16] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. Lopez. The SYNTHIA Dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016.
  • [17] M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser, and V. Koltun. MINOS: Multimodal indoor simulator for navigation in complex environments. arXiv, 2017.
  • [18] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. In CVPR, 2017.
  • [19] G. Synnaeve, N. Nardelli, A. Auvolat, S. Chintala, T. Lacroix, Z. Lin, F. Richoux, and N. Usunier. Torchcraft: a library for machine learning research on real-time strategy games. arXiv, 2016.
  • [20] Y. Tian, Q. Gong, W. Shang, Y. Wu, and L. Zitnick. ELF: an extensive, lightweight and flexible research platform for real-time strategy games. arXiv, 2017.
  • [21] B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, and A. Sumner. TORCS, the open racing car simulator, v1.3.5., 2013.
  • [22] Y. Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta, R. Mottaghi, and A. Farhadi. Visual semantic planning using deep successor representations. In ICCV, 2017.
  • [23] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description