Generalization to Novel Objects Using Prior Relational Knowledge

Generalization to Novel Objects Using Prior Relational Knowledge

Varun Kumar Vijay
Intel AI Lab
Santa Clara, CA
varun.v.kumar@intel.com
   Abhinav Ganesh
Intel AI Lab
Santa Clara, CA
abhinav.ganesh@intel.com
   Hanlin Tang
Intel AI Lab
Santa Clara, CA
hanlin.tang@intel.com
   Arjun K. Bansal
Intel AI Lab
Santa Clara, CA
arjun.bansal@intel.com
Abstract \vskip7.227pt

To solve tasks in new environments involving objects unseen during training, agents must reason over prior information about those objects and their relations. We introduce the Prior Knowledge Graph network, an architecture for combining prior information, structured as a knowledge graph, with a symbolic parsing of the visual scene, and demonstrate that this approach is able to apply learned relations to novel objects whereas the baseline algorithms fail. Ablation experiments show that the agents ground the knowledge graph relations to semantically-relevant behaviors. In both a Sokoban game and the more complex Pacman environment, our network is also more sample efficient than the baselines, reaching the same performance in 5-10x fewer episodes. Once the agents are trained with our approach, we can manipulate agent behavior by modifying the knowledge graph in semantically meaningful ways. These results suggest that our network provides a framework for agents to reason over structured knowledge graphs while still leveraging gradient based learning approaches.

1 Introduction

Humans have a remarkable ability to both generalize known actions to novel objects, and reason about novel objects once their relationship to known objects is understood. For example, on being told a novel object (e.g. ’bees’) is to be avoided, we readily apply our prior experience avoiding known objects without needing to experience a sting. Deep Reinforcement Learning (RL) has achieved many remarkable successes in recent years including results with Atari [1] games and Go [2] that have matched or exceeded human performance. While a human playing Atari games can, with a few sentences of natural language instruction, quickly reach a decent level of performance, modern end-to-end deep reinforcement learning methods still require millions of frames of experience (for e.g. see Fig. 3 in [3]). Past studies have hypothesized a role for prior knowledge in addressing this gap between human performance and Deep RL [4, 3].

While other works have studied the problem of generalizing tasks involving the same objects (and relations) to novel environments, goals, or dynamics [5, 6, 7, 8, 9, 10], here we specifically study the problem of generalizing known relationships to novel objects. Zero-shot transfer of such relations could provide a powerful mechanism for learning to solve novel tasks. We speculated that objects might serve as a useful intermediate symbolic representation to combine the visual scene with knowledge graphs encoding the objects and their relations [11].

To build this approach, we needed novel components to transfer information between the knowledge graph and the symbolic scene. Prior approaches [12, 13] have only used one-directional transfers, and without diverse edge relation types. In this paper, we propose the Prior Knowledge Graph Network (PKGNet), which makes several key contributions:

  1. We introduced new layer types (Broadcast, Pooling and KG-Conv for sharing representations between the knowledge graph and the symbolic visual scene.

  2. We leveraged edge-conditioned convolution [14, 15] to induce our method to learn edge-specific relations that can be applied to novel objects.

  3. Compared to several baselines (DQN [1], DQN-Prioritized Replay [16], A2C [17]) in two environments (Sokoban, PacMan), our approach is 5-10x more sample efficient during training, and importantly, able to apply learned relations to novel objects.

  4. We describe a mechanistic role for how the knowledge graph is leveraged in solving the tasks.

We observed agents’ behavior while manipulating the knowledge graph during runtime (i.e., using a trained agent), which confirmed that those edges were grounded to the game semantics. These results demonstrate that our PKGNet framework can provide a foundation for faster learning and generalization in deep reinforcement learning.

2 Related Work

2.1 Graph Based Reinforcement Learning

Graph-based architectures for reinforcement learning has been applied in several contexts. In recent work in control problems and text-based games, the graph is used as structured representations of the state space, either according to the anatomy of the agent [9] or to build a description of relations in the world during text based exploration [13]. We use the graph as structured prior knowledge to inject into the network.

In Yang et al [12], a knowledge graph was used to assist agents in finding novel objects in a visual search task. Our work has several key differences. Here we test methods for applying learned relations (e.g. push, avoid, chase) to novel objects. In their approach, adding those explicit relations to the graph significantly impaired performance, so relations are omitted from their knowledge graph, rendering their approach not viable in our task. Yang et al. also rely on significantly more prior knowledge, including word embeddings and object co-occurrence. Instead, here we provide arbitrary one-hot relation vectors, and agents learn to ground those vectors to action, while achieving better relative performance gains.

Previous approaches take a one-directional approach of concatenating the knowledge graph features into the state features. We hypothesize that reasoning in both feature space and structured representations is important, and therefore introduce components to share representations between the two domains. By pooling state features into the graph and performing convolution, our model implements a global operation similar to the self-attention layer used in the Relational RL architecture [10]. However, that model tackles the problem of learning relational knowledge during training, without any a priori knowledge. Our model is designed to exploit external knowledge to generalize to new objects at test time.

2.2 Extracting symbols for Reinforcement Learning

Several studies have extracted objects from visual input using unsupervised or semi-supervised methods [18, 19, 20, 21, 22, 23, 24, 25, 26]. As the focus of our study is combining scene graphs and knowledge graphs, and not the extraction of symbols themselves, we assume that our network has object level ground truth information available from the scene. For this reason we use environments that can be programmatically generated. In comparison to approaches that operate in 3-D environments [12], our approach solves fundamental problems of how and whether prior relational knowledge encoded with extremely minimal knowledge can be leveraged efficiently.

3 Prior Knowledge Graph Network

Figure 1: Prior Knowledge Graph network (PKGNet) architecture. The knowledge graph is first operated on by graph convolution layers (GConv) to enrich the node features to dimensions [14]. We then use Broadcast to create a compatible scene representation , indicated here by the cube. The network trunk consists of several KG-Conv layers. The side branch (blue dotted region) allows for reasoning over the knowledge graph structure. See main text for a more detailed description.

Reinforcement learning models often use RGB features as input to a Convolutional Neural Network (CNN). We apply our algorithms to symbolically parsed visual environments by encoding each symbol into a one-hot vector, as is done for character-level CNN models in natural language processing [27]. In our proposed PKGNet, the input consists of a knowledge graph (representing prior information) and a scene graph.

State. While PKGNet can handle general scene graphs, the environments in this paper are 2-D grid worlds, a specific subset of scene graphs where the vertices are the symbols, and the edges connect neighboring entities. Edge-conditioned graph convolution then reduces to regular 2D convolution. Therefore, we refer to the scene graph by its state representation , which is the feature map.

Knowledge Graph. The knowledge graph is a directed graph provided as vertices for each symbol in the environment (for subjects and objects), initially encoded as one-hot vectors of length , and edge features . The edge features (for relations) are represented as one-hot vectors. The connectivity of the graph, as well as the edge features are designed to reflect the structure of the environment. During training, the knowledge graph’s structure and features are fixed. Importantly, while we provide the one-hot encoded representation of the edge relationships, the agent must learn to ground the meaning of this representation in terms of rewarding actions during training. If successfully grounded, the agent may use this representation during the test phase when it encounters novel objects connected with known relationships to entities in the knowledge graph.

Algorithms. We tested our network with the Deep-Q Network [1], as well as Prioritized Experience Replay (PER) [16], and the A2C algorithm [17].

3.1 Model Architecture

The model architecture is shown in Figure 1. First, we apply two layers of edge-conditioned graph convolution (ECC) [14] to to enrich the node features with information from the neighborhood of each node. Those features are then encoded in the state representation through a Broadcast layer. The network’s main trunk consists of several KG-Conv layers, which serve to jointly convolve over the state and knowledge graph. The side branch (dotted blue rectangle), enables reasoning over the structured knowledge graph. In the side branch, we first update the knowledge graph with Pooling from the state, followed by graph convolutions. Then, we update the state representation with a KG-Conv layer, which incorporates the updated knowledge graph. Finally, for DQN and DQN-PER, we use a few linear layers to compute the Q-values for each action. For A2C, we also emit a value estimate. We provide below more details on the individual components.

3.2 Model Components

We introduce several operations for transferring information between the state representation and the knowledge graph . We can Broadcast the knowledge graph features into the state, or use Pooling to gather the state features into the knowledge graph nodes. We can also update the state representation by jointly convolving over , which we call KG-Conv. Supplemental Figure 1 shows visual depictions of these operations.

Graph Convolutions.

In order to compute features for the entities in the knowledge graph, we use an edge-conditioned graph convolution (ECC) [14]. In this formulation, a multilayer perceptron network is used to generate the filters given the edge features as input. Each graph layer computes the feature of each node as:

(1)

where the weight function depends only on the edge feature and is parameterized as a neural network. is the set of nodes with edges into . Our implementation uses graph convolution layers with features, and the weight network is a single linear layer with 8 hidden units. The output of is a graph with nodes .

Broadcast.

We define the function . For each entity in the knowledge graph, we copy its graph representation to each occurrence of in the game map. This is used to initialize the state representation such that we are using a common embedding to refer to entities in both and .

Pooling.

The reverse of Broadcast, this operation is used to update the entity representations in the knowledge graph. In , we update the graph’s representation by averaging the features in over all instances of entity corresponding to in the state.

KG-Conv.

To update the state representation , we augment a regular convolution layer with the knowledge graph. In addition to applying convolutional filters to the neighborhood of a location, we also add the node representation of the entity at that location, passed through linear layer to to match the number of filters in the convolution. Formally, we can describe this operation as:

(2)

This provides a skip connection allowing deeper layers in the network to more easily make use of the global representations.

4 Experiments

Previous environments measured generalization to more difficult levels [28, 6], modified environment dynamics [7], or different solution paths [10]. These environments, however, do not introduce new objects at test time. To quantify the generalization ability of PKGNet to unseen objects, we needed a symbolic game with the ability to increment the difficulty in terms of the number of new objects and relationships. Therefore, we use a variation of the Sokoban environment, where the agent pushed balls into the corresponding bucket, and new ball and bucket objects and their pairing are provided at test time. We also benchmarked our model and the baseline algorithms on Pacman, after extracting a symbolic representation [29].

4.1 Sokoban

A variant of the Sokoban environment is implemented using the pycolab environment [30]. The set of rewarded ball-bucket pairs varies, and in the test games the agent sees balls or buckets not seen during training. For the variations, see Table 2. We increasingly vary the difficulty of the environment by the number of ball-bucket pairs, the complexity of the grouping, and the number of unseen objects. The buckets-repeat is a challenging environment, with complex relationships in the test environment.


Name Training Pairs Test Pairs
one-one
two-one
five-two
buckets , , , , , , , ,
buckets-repeat , , …, , , …,
Table 1: Experiment variations for the Sokoban environment. The agent is rewarded for pushing the ball into the correct bucket. For each type, we list the rewarded ball-bucket pairs in the training and test games. Note that the test games only include ball types not seen in the training games. Sets denote rewarded combinations. For example, means and are rewarded.

4.2 Pacman

We test the agents on the smallGrid, mediumClassic, and capsuleClassic environments from Pacman. The environments differed in the size of the map as well as the numbers of ghosts, coins, and capsules present. The agent experienced +10 points for eating a coin, +200 for eating a ghost, +500 for finishing the coins, -500 for being eaten, and -1 for each move.

4.3 Knowledge graph construction

For both environments, we add all entities to the knowledge graph with the exception of blank spaces. We then add edges between objects to reflect relationships present in the game structure. Each entity or edge type is assigned a unique one-hot vector; note however that edges between two pairs of entities may have the same edge type if they are connected with a similar relationship. While we attach semantic meaning to these edge categories, their utility is grounded by the model during training. Additional details are in the supplement.

Figure 2: Sokoban results. For the environments described in Table 2 (columns), performance of the baseline DQN (green), our proposed PKG-DQN (blue), and a variant of PKG-DQN with edges removed (orange) over the number of training episodes. Success rate (fraction of environments completed within 100 steps) is shown for training (top row) and test (bottom row) environments. Bold lines are the average over runs, and shaded area denotes the standard error. A moving average of episodes was applied.

Figure 3: Algorithms and Ablations. (A) Our approach with the graph (blue) is more sample efficient in the training environment (first row) compared to the baselines (green) across all tested algorithms (DQN, DQN-PER, and A2C), as indicated by the line styles. The graph approach is also required for generalizing to novel objects in the test environment (second row). (B) Performance drops significantly for the graph network when we remove the side branch (gray).

5 Experiments

Previous environments measured generalization to more difficult levels [28, 6], modified environment dynamics [7], or different solution paths [10]. These environments, however, do not introduce new objects at test time. To quantify the generalization ability of PKGNet to unseen objects, we needed a symbolic game with the ability to increment the difficulty in terms of the number of new objects and relationships. Therefore, we use a variation of the Sokoban environment, where the agent pushed balls into the corresponding bucket, and new ball and bucket objects and their pairing are provided at test time. We also benchmarked our model and the baseline algorithms on Pacman, after extracting a symbolic representation [29].

5.1 Sokoban

A variant of the Sokoban environment is implemented using the pycolab environment [30]. The set of rewarded ball-bucket pairs varies, and in the test games the agent sees balls or buckets not seen during training. For the variations, see Table 2. We increasingly vary the difficulty of the environment by the number of ball-bucket pairs, the complexity of the grouping, and the number of unseen objects. The buckets-repeat is a challenging environment, with complex relationships in the test environment.


Name Training Pairs Test Pairs
one-one
two-one
five-two
buckets , , , , , , , ,
buckets-repeat , , …, , , …,
Table 2: Experiment variations for the Sokoban environment. The agent is rewarded for pushing the ball into the correct bucket. For each type, we list the rewarded ball-bucket pairs in the training and test games. Note that the test games only include ball types not seen in the training games. Sets denote rewarded combinations. For example, means and are rewarded.

5.2 Pacman

We test the agents on the smallGrid, mediumClassic, and capsuleClassic environments from Pacman. The environments differed in the size of the map as well as the numbers of ghosts, coins, and capsules present. The agent experienced +10 points for eating a coin, +200 for eating a ghost, +500 for finishing the coins, -500 for being eaten, and -1 for each move.

5.3 Knowledge graph construction

For both environments, we add all entities to the knowledge graph with the exception of blank spaces. We then add edges between objects to reflect relationships present in the game structure. Each entity or edge type is assigned a unique one-hot vector; note however that edges between two pairs of entities may have the same edge type if they are connected with a similar relationship. While we attach semantic meaning to these edge categories, their utility is grounded by the model during training. Additional details are in the supplement.

Figure 4: Sokoban results. For the environments described in Table 2 (columns), performance of the baseline DQN (green), our proposed PKG-DQN (blue), and a variant of PKG-DQN with edges removed (orange) over the number of training episodes. Success rate (fraction of environments completed within 100 steps) is shown for training (top row) and test (bottom row) environments. Bold lines are the average over runs, and shaded area denotes the standard error. A moving average of episodes was applied.

Figure 5: Algorithms and Ablations. (A) Our approach with the graph (blue) is more sample efficient in the training environment (first row) compared to the baselines (green) across all tested algorithms (DQN, DQN-PER, and A2C), as indicated by the line styles. The graph approach is also required for generalizing to novel objects in the test environment (second row). (B) Performance drops significantly for the graph network when we remove the side branch (gray).

6 Results

Our network can be used in conjunction with a variety of RL algorithms. Here we tested PKG-DQN, PKG-A2C, and PKG-PER against the regular convolutional baselines in the Sokoban and Pacman environments. In addition, we compared the performance of different knowledge graph architectures during training. We also demonstrated the ability to manipulate agent behavior by changing the knowledge graph at test time.

6.1 Sokoban

In the Sokoban environment, the PKG-DQN model was more sample efficient during training than the baseline Conv-DQN algorithm, as shown in Figure 4. For example, in the one-one environment, our model required approximately 8x fewer samples to reach the solution in the training environment. In addition, in more complex environments with an increased number of possible objects and ball-bucket pairings, the baseline Conv-DQN required increasingly more samples to solve, whereas the PKG-DQN solved in the same number of samples.

We tested zero-shot transfer learning by placing the trained agents in environments with objects unseen during training. The PKG-DQN is able to leverage the knowledge graph to generalize, solving in of the test environments (see Figure 4, bottom row). The baseline DQN failed completely to generalize to these environments.

When we deleted the edges from the PKG-DQN (Figure 4, orange lines), the model trained slower and failed to generalize. The No Edge condition still trained faster than the baseline Conv-DQN, possibly due to additional parameters in the KG-Conv, however, that advantage is minimal in our most complex Sokoban environment, the buckets-repeat. We also tested baselines with significantly more parameters and different learning rates without improvement.

These observations held across all baselines tested (DQN, DQN - Prioritized Replay, and A2C), as shown in Figure 5A. The relative performance of the Conv baselines is consistent with previous results (for e.g. Ms. Pacman in Table S3 in [17]). We also ran an ablation study where we removed the side branch from Figure 1 (blue dotted rectangle), which significantly impacted sample efficiency and generalization (Figure 5B). This demonstrates that the structured reasoning, enabled by the bi-directional flow of information from our novel layers, is important for performance.

6.2 Knowledge graph types

In order to determine whether the results in Figure 4 are sensitive to the choice of the knowledge graph architecture, we trained the PKG-DQN model with variants of the base knowledge graph, as shown in Figure 6: ‘Base’ (graph cropped to entities present in the scene), graphs with same (‘Same Edges’) or no (‘No Edges’) edges, a fully connected graph either the same edge label (‘Fully Connected’) or distinct edge labels (‘Fully Connected - Distinct’), and a ‘Complete’ graph with no cropping based on presence in the scene.

When we removed the edge distinctiveness (‘Same Edges’), the model still trained, but failed to generalize to novel objects. If we removed edges entirely (‘No Edges’), the performance is the same as the baseline DQN. These results show that encoding the game structure into the knowledge graph is important for generalizing to the test environment but not necessary for the training environments.

Surprisingly, when the knowledge graph is fully connected (‘FC’ and ‘FC-distinct’), the model does not train, suggesting that the prior structure cannot be learned by PKG-DQN. If the complete graph is available during training, including nodes for objects that only appear in the test environments, the model generalizes to near-optimal performance (see orange lines in ‘Complete’). In this condition, even though the object ‘c’ is not in the training environment, gradients still flow through the edge. To avoid any contamination during training into the knowledge graph of information about ball-bucket object pairs seen during test, for the base condition we crop the knowledge graph only to entities (and corresponding edges) seen in the training environments.

Figure 6: Performance of the PKG-DQN when trained with several knowledge graph variants in our most challenging environment (buckets-repeat). Train performance in blue, and test performance in orange. Shaded errors indicate the standard error over repetitions. Results in other environments are similar, but omitted for space reasons. See Supplement for additional environments.

6.3 Pacman

The PKG-DQN converges significantly faster to a performing control policy than the convolution-based DQN on all three Pacman environments (Figure 7). Both models reach similar levels of final performance on smaller environments, which is expected, as the convolutional model should eventually be able to deduce the relations between the symbols with enough training.

Figure 7: Pacman results. Performance of the baseline Conv-based models (green, purple, orange) and our PKG-DQN (blue) agent on several Pacman environments (smallGrid, mediumClassic, and capsuleClassic). Bold lines are the mean.

6.4 What do the agents learn?

To understand how the agents are interpreting the edge relations between objects, we observed the behavior of a trained agent running in an environment while manipulating the knowledge graph (Figure 8). For simplicity consider the one-one environment, with one bucket pair () during training and one pair () during testing. When we removed , the agent still pushes the ball, but does not know where to push the ball towards, suggesting that the agent has learned to ground the feature as ’goal’ or ’fills’. We swapped the edge features of and , and the agent attempts to push the bucket into the ball. The knowledge graph could also be manipulated such the agent pushes a ball into another ball (Supplement). These studies show that the agent learned the ’push’ and ’fills’ relation and can apply these actions to objects it has never pushed before.

Similarly, in Pacman, if we remove the PlayerScared Ghost edge, the agent no longer chases the scared ghosts (Table 3). Without an edge to the capsule, the agent no longer eats the capsule. The agent can also be manipulated to not avoid ghosts by changing the GhostPlayer feature to the PlayerCoin edge relation.


Variation Reward Behavior
Base Default behavior
Set GhostPlayer to PlayerCoin feature Does not avoid ghosts
Remove PlayerScared Ghost edge Does not chase scared ghosts
Remove PlayerCoin edge Pacman moves randomly
Remove PlayerCapsule edge Does not eat the capsule
Remove PlayerWall edge Runs into the nearest wall
Remove Scared GhostWall edge Does not chase scared ghosts
Table 3: Manipulating Pacman behavior. Behavior and score of the PKG-DQN agent on the mediumClassic map when various edges are removed or features substituted. Reward is shown as mean standard error over repetitions.

Figure 8: Manipulating Trained Agents in Sokoban. We used agents trained on the base knowledge graph. and manipulated their behavior at runtime by changing the input knowledge graph.

7 Discussion

We’ve demonstrated the efficacy of a general approach to augmenting networks with knowledge graphs that facilitate faster learning, and more critically enable algorithms to apply their learned relations to novel objects. This was substantiated with experiments across two environments and multiple algorithms. Ablation studies highlight the importance of bi-directional information exchange between the state features and knowledge graph, in contrast to previous work with one-directional feature concatenation [12, 13]. Moreover, the use of edge-conditioned convolution allows the agent to ground and leverage edge relations, as well as generalize to changing knowledge graphs. Previous approaches [12] are most similar to our ‘Same Edges‘ case (Figure 6), which was significantly worse performing.

Our approach is complementary to other approaches in RL that strive to improve sample efficiency and generalization such as hierarchical RL [31], metalearning [5], or better exploration policies [32] and can be combined as such with these approaches to build better overall systems. Interestingly, attempts to learn the knowledge graph during training were not successful (see ’fully connected’ in Figure 6), and we speculate that graph attention models [33] could help prune the graph to only the useful relations. We used simple one-hot edge features throughout, whereas one could use word embeddings [34, 35] to seed the knowledge graph with semantic information.

The field has long debated the importance of reasoning with symbols and its compatibility with gradient based learning. Our architecture provides one framework to bridge these seemingly disparate approaches [36].

References

  • [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei a Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, arXiv:1312.5602(7540):529–533, 2015.
  • [2] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, arXiv:1712.01815:1140–1144, 2018.
  • [3] Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, Nov 2016.
  • [4] Rachit Dubey, Pulkit Agrawal, Deepak Pathak, Thomas L. Griffiths, and Alexei A. Efros. Investigating human priors for playing video games. arXiv:1802.10217, 2018.
  • [5] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. ICML, abs/1703.03400, 2017.
  • [6] Alex Nichol, Vicki Pfau, Christopher Hesse, Oleg Klimov, and John Schulman. Gotta learn fast: A new benchmark for generalization in rl. arXiv:1804.03720, 2018.
  • [7] Charles Packer, Katelyn Gao, Jernej Kos, Philipp Krahenbuhl, Vladlen Koltun, and Dawn Song. Assessing generalization in deep reinforcement learning. arXiv:1810.12282, 2018.
  • [8] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. CoRR, abs/1606.04671, 2016.
  • [9] Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. NerveNet: Learning structured policy with graph neural networks. In International Conference on Learning Representations, 2018.
  • [10] Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, Murray Shanahan, Victoria Langston, Razvan Pascanu, Matthew Botvinick, Oriol Vinyals, and Peter Battaglia. Deep reinforcement learning with relational inductive biases. In International Conference on Learning Representations, 2019.
  • [11] Michael Janner, Sergey Levine, William T. Freeman, Joshua B. Tenenbaum, Chelsea Finn, and Jiajun Wu. Reasoning about physical interactions with object-centric models. In International Conference on Learning Representations, 2019.
  • [12] Wei Yang, Xiaolong Wang, Ali Farhadi, Abhinav Gupta, and Roozbeh Mottaghi. Visual semantic navigation using scene priors. ICLR, 2019.
  • [13] Prithviraj Ammanabrolu and Mark O. Riedl. Playing text-adventure games with graph-based deep reinforcement learning. NAACL19, abs/1812.01628, 2018.
  • [14] Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. CoRR, abs/1704.02901, 2017.
  • [15] Marcel Nassar. Hierarchical bipartite graph convolution networks. CoRR, abs/1812.03813, 2018.
  • [16] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay, 2015.
  • [17] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning, 2016.
  • [18] Christopher P. Burgess, Loïc Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matthew Botvinick, and Alexander Lerchner. Monet: Unsupervised scene decomposition and representation. CoRR, abs/1901.11390, 2019.
  • [19] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. NIPS, 2016.
  • [20] Brian Cheung, Jesse A. Livezey, Arjun K. Bansal, and Bruno A. Olshausen. Discovering hidden factors of variation in deep networks. arXiv:1412.6583, 2014.
  • [21] Klaus Greff, Raphaël Lopez Kaufmann, Rishabh Kabra, Nick Watters, Christopher Burgess, Daniel Zoran, Loïc Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation learning with iterative variational inference. ICML, abs/1903.00450, 2019.
  • [22] Irina Higgins, Loïc Matthey, Xavier Glorot, Arka Pal, Benigno Uria, Charles Blundell, Shakir Mohamed, and Alexander Lerchner. Early visual concept learning with unsupervised deep learning. CoRR, abs/1606.05579, 2016.
  • [23] Irina Higgins, Arka Pal, Andrei A. Rusu, Loïc Matthey, Christopher Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. DARLA: Improving zero-shot transfer in reinforcement learning. In ICML, 2017.
  • [24] Catalin Ionescu, Tejas Kulkarni, Aäron van den Oord, Andriy Mnih, and Vlad Mnih. Learning to control visual abstractions for structured exploration in deep reinforcement learning. Deep Reinforcement Learning workshop, NeurIPS 2018, 2018.
  • [25] Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. CoRR, abs/1312.6114, 2013.
  • [26] Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv:1811.12359, 2018.
  • [27] Yann LeCun Xiang Zhang, Junbo Zhao. Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems, 2015.
  • [28] Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341, 2018.
  • [29] UC Berkeley. Pacman environment. 2019.
  • [30] Thomas Stepleton. The pycolab game engine, 2017.
  • [31] Tejas D. Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Joshua B. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. CoRR, abs/1604.06057, 2016.
  • [32] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. Montezuma’s Revenge solved by Go-Explore, a new algorithm for hard-exploration problems (sets records on Pitfall, too), 2018.
  • [33] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.
  • [34] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space, 2013.
  • [35] Christopher De Sa, Albert Gu, Christopher Ré, and Frederic Sala. Representation tradeoffs for hyperbolic embeddings, 2018.
  • [36] Marta Garnelo and Murray Shanahan. Reconciling deep learning with symbolic artificial intelligence: representing objects and relations. Current Opinion in Behavioral Sciences, 29:17–23, Oct 2019.
  • [37] Gal Novik Shadi Endrawis Itai Caspi, Gal Leibovich. Reinforcement Learning Coach, 2017.
  • [38] Max Lapan. Pytorch agentnet, 2018.
  • [39] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. 2017.
  • [40] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. 2016.
  • [41] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference for Learning Representations, 2015.
  • [42] Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D. Manning. Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 344–354. Association for Computational Linguistics, 2015.
  • [43] Michael Beetz, Moritz Tenorth, and Jan Winkler. Open-EASE – a knowledge processing service for robots and robotics/AI researchers. In IEEE International Conference on Robotics and Automation (ICRA), Seattle, Washington, USA, 2015. Finalist for the Best Cognitive Robotics Paper Award.
  • [44] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In In SIGMOD Conference, pages 1247–1250, 2008.
  • [45] Doug Lenat, Mayank Prakash, and Mary Shepherd. CYC: Using common sense knowledge to overcome brittleness and knowledge acquistion bottlenecks. AI Mag., 6(4):65–85, January 1986.
  • [46] H. Liu and P. Singh. ConceptNet - a practical commonsense reasoning tool-kit. BT Technology Journal, 22(4):211–226, October 2004.
  • [47] Ashutosh Saxena, Ashesh Jain, Ozan Sener, Aditya Jami, Dipendra K. Misra, and Hema S. Koppula. RoboBrain: Large-scale knowledge engine for robots. arXiv:1412.0691, 2014.
  • [48] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 697–706, New York, NY, USA, 2007. ACM.

Appendix A Supplemental Methods

a.1 Sokoban

The environment consists of a grid, where the agent is rewarded for pushing balls into their matching buckets. Lower case alphanumeric characters refer to balls, and upper case as buckets. The agent is identified as the A symbol, and the walls with +. For each variation, we generated 100 training mazes, and 20 testing mazes, randomly varying the location of the agent, ball(s), and bucket(s) in each maze. The agent received a reward of for a successful pairing, and a penalty of for each time step taken.

a.2 Knowledge graph construction

The Sokoban games had the edges similar to those shown in Figure 4 of the main text, with an edge feature of ‘1’ from the agent to all balls to encode a ’pushes’ relationship; edge feature of ‘2’ between all rewarded ball-bucket pairs; and an edge feature of ‘0’ between the agent and impassable objects: the bucket(s) and the wall symbol.

In Pacman, we add an ’impassable’ relation from all the agents (player, ghost, and scared ghost) to the wall. We also add distinct edges from the player to all edible entities and agents (coin, capsule, scared ghost, ghost).

a.3 Baseline Algorithms

We used DQN [1], Prioritized Experience Replay (PER) [16], and A2C [17] as the baseline RL algorithms. To keep the comparison fair, each baseline also received symbolic input. In the Sokoban experiments, we used a convolutional network consisting of . This model is equivalent to the PKGNet architecture with the connections from the knowledge graph removed. We performed an architecture search and did not find a model that outperformed it.

The best agent in Pacman had a deeper and wider convolutional network with four layers with filters, followed by a multilayer perception of .

In both models, after the convolutional layers, we computed a per-channel mean over the 2D map and passed the resulting vector into the multilayer perceptron (MLP).

We validated our implementation of the algorithm by comparing our performance on the Cartpole and Pong environments with those in Coach [37] and Ptan [38]. Software was implemented in Pytorch [39] and is attached with the manuscript (see Supplement). OpenAI Gym [40] and pycolab [30] were used to implement the environments.

a.4 Hyperparameters

We ran our experiments using the Adam optimizer with learning rate of in the Sokoban environments and in Pacman [41]. We used a replay buffer size of 100,000 throughout; at every step, we sampled transitions from the buffer and trained the agent by minimizing the loss. In the Sokoban environments, we allowed the agent to run for 10,000 steps before commencing training.

Appendix B Supplemental Figures

We provide several additional figures that were not included in the main paper:

  • Figure 9: Graphical depictions of the three contributed methods (Broadcast, Pooling, and KG-Conv for transferring information between the knowledge graph and the scene representation.

  • Figure 10: In addition to Pacman, we also ran experiments with Sokoban where we took an agent trained on the knowledge graph, and observed its behavior when the input knowledge graph was altered. We were able to manipulate the agent behavior, and confirm that the learned edge semantics match the game structure and can be applied to novel objects. Just by changing the knowledge graph at test time, the agent can be manipulated to push buckets into balls, or push balls into other balls.

  • Figure 11: A more complete version of Figure 3 from the paper, with additional environments (one-one, two-one) trained with different knowledge graph variants.

Figure 9: Layer types in our PKGNet architecture, including methods to Broadcast from the knowledge graph to the state presentation , Pooling from , and updating the state by jointly convolving over .

Appendix C Model components

In this section, we provide more details on the model components, as shown in Figure 9, and described in the main text. We duplicate some of the text from the main paper here for readability.

Broadcast.

We define the function . For each entity in the knowledge graph, we copy its graph representation to each occurrence of in the game map. This is used to initialize the state representation such that we are using a common embedding to refer to entities in both and . Formally, each location in the state is computed as

(3)

where if the entity corresponding to is present at location and zero otherwise. Thus, symbols in the game map not present in the knowledge graph are initialized with a zero vector.

Pooling.

The reverse of Broadcast, this operation is used to update the entity representations in the knowledge graph. In , we update the graph’s representation by averaging the features in over all instances of entity corresponding to in the state:

(4)

where is the number of instances of in the state. Since and may have different number of features, we used the weight matrix to project from the state vectors to the dimensionality of the vertex features in the graph.

KG-Conv.

To update the state representation , we augment a regular convolution layer with the knowledge graph. In addition to applying convolutional filters to the neighborhood of a location, we also add the node representation of the entity at that location, passed through linear layer to to match the number of filters in the convolution. Formally, we can describe this operation as:

(5)

Appendix D Extended Future Directions

d.1 Scenes

The use of scene graphs could provide a framework to handle partial observability by building out portions of the environment as they are explored and storing them in the scene graph. As models that can extract objects from frames improve [19, 20, 22, 24, 25, 26], connecting the outputs of these models as inputs to the models developed here could provide a mechanism to go directly from pixels to actions.

d.2 Interpretability

The knowledge graph provides an interpretable way to instruct the Deep RL system the rules of the game. While not explored here these rules could include the model of the environment facilitating use of PKG-DQN in model-based RL. Future work could explore whether the structure of the knowledge graph combined with the interpretability of the nodes and edges could serve as a mechanism to overcome catastrophic forgetting. For example, new entities and relationships could be incrementally added to the knowledge graph encoded in a way that is compatible with existing relationships and with potentially minimal disruption to existing entities and their relationships. A limitation is that even though the knowledge graph itself is interpretable, once the messages from the knowledge graph are combined with messages in the scene graph we sacrifice interpretability in favor of the learning power of gradient based Deep Learning.

d.3 Knowledge graph

While we are hand coding the knowledge graph in this study, future work could learn the knowledge graph directly from a set of environments, or via information extraction approaches on text corpora, or learn graph attention models over existing large knowledge graphs [42, 43, 44, 45, 46, 47, 48]. Knowledge graphs could also be generalized beyond the triplet structure to incorporate prior or instructional information in the form of computational graphs.

d.4 Environments

While we limited our analysis here to relatively small environments to test the fundamental aspects of our approach, scaling to larger environments is another obvious direction. Environments such as OpenAI Retro [6] or CoinRun [28] have helped spark an interest in the problem of generalization in Deep RL. However, the lack of readily available ground truth and inability to programmatically generate levels hinders a rigorous development of algorithmic approaches to solve this problem using Retro. We believe that further development of benchmarks for generalization in Deep RL [7] that enable programmatic game creation and make ground truth accessible will help the field.

Figure 10: Manipulating agent behavior. We use an already trained agent, and manipulated its behavior at test time by modifying the input knowledge graph. For each manipulation, we show the resulting knowledge graph, the game state, and the resulting agent behavior. These studies show that the agent learned the semantic meaning of edges (’push’, ’target’) that we intended, and are able to apply those learned relations to different objects. For example, the trained agent can be manipulated to push buckets into balls, or balls into other balls without any additional training.

Figure 11: Model performance of PKG-DQN when trained on various knowledge graph types, in the one-one, two-one, and five-two environments. Tested types are described in the paper.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
379764
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description