Natural Language State Representation for Reinforcement Learning
Recent advances in Reinforcement Learning have highlighted the difficulties in learning within complex high dimensional domains. We argue that one of the main reasons that current approaches do not perform well, is that the information is represented sub-optimally. A natural way to describe what we observe, is through natural language. In this paper, we implement a natural language state representation to learn and complete tasks. Our experiments suggest that natural language based agents are more robust, converge faster and perform better than vision based agents, showing the benefit of using natural language representations for Reinforcement Learning.
[Language: An Introduction to the Study of Speech, 1921]Edward Sapir “The world of our experiences must be enormously simplified and generalized before it is possible to make a symbolic inventory of all our experiences of things and relations.”
Deep Learning based algorithms use neural networks in order to learn feature representations that are good for solving high dimensional Machine Learning (ML) tasks. Reinforcement Learning (RL) is a subfield of ML that has been greatly affected by the use of deep neural networks as universal function approximators [csaji2001approximation; lu2017expressive]. These deep neural networks are used in RL to estimate value functions, state-action value functions, policy mappings, next-state predictions, rewards, and more [mnih2015human; schulman2017proximal; hafner2018learning], thus combating the “curse of dimensionality”.
The term representation is used differently in different contexts. For the purpose of this paper we define a semantic representation of a state as one that reflects its meaning as it is understood by an expert. The semantic representation of a state should thus be paired with a reliable and computationally efficient method for extracting information from it. Previous success in RL has mainly focused on representing the state in its raw form (e.g., visual input in Atari-based games [mnih2015human]). This approach stems from the belief that neural networks (specifically convolutional networks) can extract meaningful features from complex inputs. In this work, we challenge current representation techniques and suggest to represent the state using natural language, similar to the way we, as humans, summarize and transfer information efficiently from one to the other [sapir2004language].
The ability to associate states with natural language sentences that describe them is a hallmark of understanding representations for reinforcement learning. Humans use rich natural language to describe and communicate their visual perceptions, feelings, beliefs, strategies, and more. The semantics inherent to natural language carry knowledge and cues of complex types of content, including: events, spatial relations, temporal relations, semantic roles, logical structures, support for inference and entailment, as well as predicates and arguments [abend2017state]. The expressive nature of language can thus act as an alternative semantic state representation.
Over the past few years, Natural Language Processing (NLP) has shown an acceleration in progress on a wide range of downstream applications ranging from Question Answering [kumar2016ask; liu2018stochastic], to Natural Language Inference [parikh2016decomposable; chen2017enhanced; chen2018neural] through Syntactic Parsing [williams2018latent; shi2018tree; shen2019ordered]. Recent work has shown the ability to learn flexible, hierarchical, contextualized representations, obtaining state-of-the-art results on various natural language processing tasks [devlin2019bert]. A basic observation of our work is that natural language representations are also beneficial for solving problems in which natural language is not the underlying source of input. Moreover, our results indicate that natural language is a strong alternative to current complementary methods for semantic representations of a state.
In this work we assume a state can be described using natural language sentences. We use distributional embedding methods111Distributional methods make use of the hypothesis that words which occur in a similar context tend to have similar meaning [firth1957synopsis], i.e., the meaning of a word can be inferred from the distribution of words around it. For this reason, these methods are called “distributional” methods. in order to represent sentences, processed with a standard Convolutional Neural Network for feature extraction. In Section 2 we describe the basic frameworks we rely on. We discuss possible semantic representations in Section 3, namely, raw visual inputs, semantic segmentation, feature vectors, and natural language representations. Then, in Section 4 we compare NLP representations with their alternatives. Our results suggest that representation of the state using natural language can achieve better performance, even on difficult tasks, or tasks in which the description of the state is saturated with task-nuisances [achille2018emergence]. Moreover, we observe that NLP representations are more robust to transfer and changes in the environment. We conclude the paper with a short discussion and related work.
2.1 Reinforcement Learning
In Reinforcement Learning the goal is to learn a policy , which is a mapping from state to a probability distribution over actions , with the objective to maximize a reward that is provided by the environment. This is often solved by formulating the problem as a Markov Decision Process (MDP) [sutton1998reinforcement]. Two common quantities used to estimate the performance in MDPs are the value and action-value functions, which are defined as follows: and . Two prominent algorithms for solving RL tasks, which we use in this paper, are the value-based DQN [mnih2015human] and the policy-based PPO [schulman2017proximal].
Deep Q Networks (DQN): The DQN algorithm is an extension of the classical Q-learning approach, to a deep learning regime. Q-learning learns the optimal policy by directly learning the value function, i.e., the action-value function. A neural network is used to estimate the -values and is trained to minimize the Bellman error, namely
Proximal Policy Optimization (PPO): While the DQN learns the optimal behavioral policy using a dynamic programming approach, PPO takes a different route. PPO builds upon the policy gradient theorem, which optimizes the policy directly, with an addition of a trust-region update rule. The policy gradient theorem updates the policy by
2.2 Deep Learning for NLP
A word embedding is a mapping from a word to a vector . A simple form of word embedding is the Bag of Words (BoW), a vector ( is the dictionary size), in which each word receives a unique 1-hot vector representation. Recently, more efficient methods have been proposed, in which the embedding vector is smaller than the dictionary size, . These methods are also known as distributional embeddings.
The distributional hypothesis in linguistics is derived from the semantic theory of language usage (i.e. words that are used and occur in the same contexts tend to have similar meanings). Distributional word representations are a fundamental building block for representing natural language sentences. Word embeddings such as Word2vec [mikolov2013efficient] and GloVe [pennington2014glove] build upon the distributional hypothesis, improving efficiency of state-of-the-art language models.
Convolutional Neural Networks (CNNs), originally invented for computer vision, have been shown to achieve strong performance on text classification tasks [johnson2015effective; bai2018empirical], as well as other traditional NLP tasks [collobert2011natural]. In this paper we consider a common architecture [kim2014convolutional], in which each word in a sentence is represented as an embedding vector, a single convolutional layer with filters is applied, producing an -dimensional vector for each -gram. The vectors are combined using max-pooling followed by a ReLU activation. The result is then passed through multiple hidden linear layers with ReLU activation, eventually generating the final output.
3 Semantic Representation Methods
Contemporary methods for semantic representation of states currently follow one of three approaches: (1) raw visual inputs [mnih2015human; kempka2016vizdoom], in which raw sensory values of pixels are used from one or multiple sources, (2) feature vectors [todorov2012mujoco; he2018amc], in which general features of the problem are chosen, with no specific structure, and (3) semantic segmentation maps [ronneberger2015u; he2017mask], in which discrete or logical values are used in one or many channels to represent the general features of the state.
The common approach is to derive decisions (e.g., classification, action, etc.) based on information in its raw form. In RL, the raw form is often the pixels representing an image – however the image is only one form of a semantic representation. In Semantic Segmentation, the image is converted from a 3-channel (RGB) matrix into an -channel matrix, where is the number of classes. In this case, each channel represents a class, and a binary value at each coordinate denotes whether or not this class is present in the image at this location. For instance, Fig. 1 considers an autonomous vehicle task. The raw image and segmentation maps are both sufficient for the task (i.e., both contain a sufficient semantic representation). Nevertheless, the semantic segmentation maps contain less task-nuisances [achille2018emergence], which are random variables that affect the observed data, but are not informative to the task we are trying to solve.
In this paper we propose a forth method for representing a state, namely using natural language descriptions. One method to achieve such a representation is through Image Captioning [tran2016rich; hossain2019comprehensive]. Natural language is both rich as well as flexible. This flexibility enables the algorithm designer to represent the information present in the state as efficiently and compactly as possible. As an example, the top image in Fig. 1 can be represented using natural language as “There is a car in your lane two meters in front of you, a bicycle rider on your far left in the negative lane, a car in your direction in the opposite lane which is twenty meters away, and trees and pedestrians on the side walk.” or compactly by “There is a car two meters in front of you a pedestrian on the sidewalk to your right and a car inbound in the negative lane which is far away.”. Language also allows us to efficiently compress information. As an example, the segmentation map in the bottom image of Fig. 1 can be compactly described by “There are 13 pedestrians crossing the road in front of you”. In the next section we will demonstrate the benefits of using natural-language-based semantic state representation in a first person shooter enviornment.
4 Semantic State Representations in the Doom Environment
In this section we compare the different types of semantic representations for representing states in the ViZDoom environment [kempka2016vizdoom], as described in the previous section. More specifically, we use a semantic natural language parser in order to describe a state, over numerous instances of levels varying in difficulty, task-nuisances, and objectives. Our results show that, though semantic segmentation and feature vector representation techniques express a similar statistic of the state, natural language representation offers better performance, faster convergence, more robust solutions, as well as better transfer.
The ViZDoom environment involves a 3D world that is significantly more real-world-like than Atari 2600 games, with a relatively realistic physics model. An agent in the ViZDoom environment must effectively perceive, interpret, and learn the 3D world in order to make tactical and strategic decisions of where to go and how to act. There are three types of state representations that are provided by the environment. The first, which is also most commonly used, is raw visual inputs, in which the state is represented by an image from a first person view of the agent. A feature vector representation is an additional state representation provided by the environment. The feature vector representation includes positions as well as labels of all objects and creatures in the vicinity of the agent. Lastly, the environment provides a semantic segmentation map based on the aforementioned feature vector. An example of the visual representations in VizDoom is shown in Fig. 2.
In order to incorporate natural language representation to the VizDoom environment we’ve constructed a semantic parser of the semantic segmentation maps provided by the environment. Each state of the environment was converted into a natural language sentence based on positions and labels of objects in the frame. To implement this, the screen was divided into several vertical and horizontal patches, as depicted in Fig. 3. These patches describe relational aspects of the state, such as distance of objects and their direction with respect to the agent’s point of view. In each patch, objects were counted, and a natural language description of the patch was constructed. This technique was repeated for all patches to form the final state representation. Fig. 4 depicts examples of natural language sentences of different states in the enviornment.
We tested the natural language representation against the visual-based and feature representations on several tasks, with varying difficulty. In these tasks, the agent could navigate, shoot, and collect items such as weapons and medipacks. Often, enemies of different types attacked the agent, and a positive reward was given when an enemy was killed. Occasionally, the agent also suffered from health degeneration. The tasks included a basic scenario, a health gathering scenario, a scenario in which the agent must take cover from fireballs, a scenario in which the agent must defend itself from charging enemies, and a super scenario, where a mixture of the above scenarios was designed to challenge the agent.
More specifically, in the basic scenario, a single monster is spawned in front of the agent. The purpose of this scenario is to teach the agent to aim at the enemy and shoot at it. In the health gathering scenario, the floor of the room is covered in toxin, causing the agent to gradually lose health. Medipacks are spawned randomly in the room and the agent’s objective is to keep itself alive by collecting them. In the take cover scenario, multiple fireball shooting monsters are spawned in front of the agent. The goal of the agent is to stay alive as long as possible, dodging inbound fireballs. The difficulty of the task increases over time, as additional monsters are spawned. In the defend the center scenario, melee attacking monsters are randomly spawned in the room, and charge towards the agent. As opposed to other scenarios, the agent is incapable of moving, aside from turning left and right and shooting. In the defend the line scenario, both melee and fireball shooting monsters are spawned near the opposing wall. The agent can only step right, left or shoot. Finally, in the “super” scenario both melee and fireball shooting monsters are repeatably spawned all over the room. the room contains various items the agent can pick up and use, such as medipacks, shotguns, ammunition and armor. Furthermore, the room is filled with unusable objects, various types of trees, pillars and other decorations. The agent can freely move and turn in any direction, as well as shoot. This scenario combines elements from all of the previous scenarios.
Our agent was implemented using a Convolutional Neural Network as described in Section 2.2. We converted the parsed state into embedded representations of fixed length. We tested both a DQN and a PPO based agent, and compared the natural language representation to the other representation techniques, namely the raw image, feature vector, and semantic segmentation representations.
In order to effectively compare the performance of the different representation methods, we conducted our experiments under similar conditions for all agents. The same hyper-parameters were used under all tested representations. Moreover, to rule out effects of architectural expressiveness, the number of weights in all neural networks was approximately matched, regardless of the input type. Finally, we ensured the “super” scenario was positively biased toward image-based representations. This was done by adding a large amount items to the game level, thereby filling the state with nuisances (these tests are denoted by ‘nuisance’ in the scenario name). This was especially evident in the NLP representations, as sentences became extensively longer (average of over 250 words). This is contrary to image-based representations, which did not change in dimension.
Results of the DQN-based agent are presented in Fig. 5. Each plot depicts the average reward (across 5 seeds) of all representations methods. It can be seen that the NLP representation outperforms the other methods. This is contrary to the fact that it contains the same information as the semantic segmentation maps. More interestingly, comparing the vision-based and feature-based representations render inconsistent conclusions with respect to their relative performance. NLP representations remain robust to changes in the environment as well as task-nuisances in the state. As depicted in Fig. 6, inflating the state space with task-nuisances impairs the performance of all representations. There, a large amount of unnecessary objects were spawned in the level, increasing the state’s description length to over 250 words, whilst retaining the same amount of useful information. Nevertheless, the NLP representation outperformed the vision and feature based representations, with high robustness to the applied noise.
In order to verify the performance of the natural language representation was not due to extensive discretization of patches, we’ve conducted experiments increasing the number of horizontal patches - ranging from 3 to 31 patches in the extreme case. Our results, as depicted in Fig. 7, indicate that the amount of discretization of patches did not affect the performance of the NLP agent, remaining a superior representation compared to the rest.
To conclude, our experiments suggest that NLP representations, though they describe the same raw information of the semantic segmentation maps, are more robust to task-nuisances, allow for better transfer, and achieve higher performance in complex tasks, even when their description is long and convoluted. While we’ve only presented results for DQN agents, we include plots for a PPO agent in the Appendix, showing similar trends and conclusions. We thus deduce that NLP-based semantic state representations are a preferable choice for training VizDoom agents.
5 Related Work
Work on representation learning is concerned with finding an appropriate representation of data in order to perform a machine learning task [goodfellow2016deep]. In particular, deep learning exploits this concept by its very nature [mnih2015human]. Work on representation learning include Predictive State Representations (PSR) [littman2002predictive; jiang2016improving], which capture the state as a vector of predictions of future outcomes, and a Heuristic Embedding of Markov Processes (HEMP) [engel2001learning], which learns to embed transition probabilities using an energy-based optimization problem.
There has been extensive work attempting to use natural language in RL. Efforts that integrate language in RL develop tools, approaches, and insights that are valuable for improving the generalization and sample efficiency of learning agents. Previous work on language-conditioned RL has considered the use of natural language in the observation and action space. Environments such as Zork and TextWorld [cote2018textworld] have been the standard benchmarks for testing text-based games. Nevertheless, these environments do not search for semantic state representations, in which an RL algorithm can be better evaluated and controlled.
eisenstein2009reading use high-level semantic abstractions of documents in a representation to facilitate relational learning using Inductive Logic Programming and a generative language model. branavan2012learning use high-level guidance expressed in text to enrich a stochastic agent, playing against the built-in AI of Civilization II. They train an agent with the Monte-Carlo search framework in order to jointly learn to identify text that is relevant to a given game state as well as game strategies based only on environment feedback. narasimhan2018grounding utilize natural language in a model-based approach to describe the dynamics and rewards of an environment, showing these can facilitate transfer between different domains.
More recently, the structure and compositionality of natural language has been used for representing policies in hierarchical RL. In a paper by hu2019hierarchical, instructions given in natural language were used in order to break down complex problems into high-level plans and lower-level actions. Their suggested framework leverages the structure inherent to natural language, allowing for transfer to unfamiliar tasks and situations. This use of semantic structure has also been leveraged by tennenholtz2019natural, where abstract actions (not necessarily words) were recognized as symbols of a natural and expressive language, improving performance and transfer of RL agents.
Outside the context of RL, previous work has also shown that high-quality linguistic representations can assist in cross-modal transfer, such as using semantic relationships between labels for zero-shot transfer in image classification [socher2013zero; frome2013devise].
6 Discussion and Future Work
Our results indicate that natural language can outperform, and sometime even replace, vision-based representations. Nevertheless, natural language representations can also have disadvantages in various scenarios. For one, they require the designer to be able to describe the state exactly, whether by a rule-based or learned parser. Second, they abstract notions of the state space that the designer may not realize are necessary for solving the problem. As such, semantic representations should be carefully chosen, similar to the process of reward shaping or choosing a training algorithm. Here, we enumerate three instances in which we believe natural language representations are beneficial:
Natural use-case: Information contained in both generic and task-specific textual corpora may be highly valuable for decision making. This case assumes the state can either be easily described using natural language or is already in a natural language state. This includes examples such as user-based domains, in which user profiles and comments are part of the state, or the stock market, in which stocks are described by analysts and other readily available text. 3D physical environments such as VizDoom also fall into this category, as semantic segmentation maps can be easily described using natural language.
Subjective information: Subjectivity refers to aspects used to express opinions, evaluations, and speculations. These may include strategies for a game, the way a doctor feels about her patient, the mood of a driver, and more.
Unstructured information: In these cases, features might be measured by different units, with an arbitrary position in the state’s feature vector, rendering them sensitive to permutations. Such state representations are thus hard to process using neural networks. As an example, the medical domain may contain numerous features describing the vitals of a patient. These raw features, when observed by an expert, can be efficiently described using natural language. Moreover, they allow an expert to efficiently add subjective information.
An orthogonal line of research considers automating the process of image annotation. The noise added from the supervised or unsupervised process serves as a great challenge for natural language representation. We suspect the noise accumulated by this procedure would require additional information to be added to the state (e.g., past information). Nevertheless, as we have shown in this paper, such information can be compressed using natural language. In addition, while we have only considered spatial features of the state, information such as movement directions and transient features can be efficiently encoded as well.
Natural language representations help abstract information and interpret the state of an agent, improving its overall performance. Nevertheless, it is imperative to choose a representation that best fits the domain at hand. Designers of RL algorithms should consider searching for a semantic representation that fits their needs. While this work only takes a first step toward finding better semantic state representations, we believe the structure inherent in natural language can be considered a favorable candidate for achieving this goal.
VizDoom is a ”Doom” based research environment that was developed at the Poznań University of Technology. It is based on ”ZDoom” game executable, and includes a Python based API. The API offers the user the ability to run game instances, query the game state, and execute actions. The original purpose of VizDoom is to provide a research platform for vision based reinforcement learning. Thus, a natural language representation for the game was needed to be implemented. ViZDoom emulates the ”Doom” game and enables us to access data within a certain frame using Python dictionaries. This makes it possible to extract valuable data including player health, ammo, enemy locations etc. Each game frame contains ”labels”, which contain data on visible objects in the game (the player, enemies, medkits, etc). We used ”Doom Builder” in order to edit some of the scenarios and design a new one. Enviroment rewards are presented in Table 2.
7.2 Natural language State Space
A semantic representation using natural language should contain information which can be deduced by a human playing the game. For example, even though a human does not know the exact distance between objects, it can classify them as ”close” or ”far”. However, objects that are outside the player’s field of vision can not be a part of the state. Furthermore, a human would most likely refer to an object’s location relative to itself, using directions such as ”right” or ”left”.
7.3 Language model implementation
To convert each frame to a natural language representation state, the list of available labels is iterated, and a string is built accordingly. The main idea of our implementation is to divide the screen into multiple vertical patches, count the amount of different objects inside by their types, and parse it as a sentence. The decision as to whether an object is close or far can be determined by calculating the distance from it to the player, and using two threshold levels. Object descriptions can be concise or detailed, as needed. We experimented with the following mechanics:
- Patch Size
the screen can be divided between patches equally, or by determined ratios. Here, our main guideline was to keep the ”front” patch narrow enough so it can be used as ”sights”.
- Patch Count
our initial experiment was with 3 patches, and later we added 2 more patches classified as ”outer left” and ”outer right”. In our experiments we have tested up to 51 patches, referred to as left or right patch with corresponding numbers.
- Distance Thresholds
we used 2 thresholds, which allowed us to classify the distance of an object from the player as ”close”,”mid”, and ”far. Depending on the task, the value of the threshold can be changed, as well as adding more thresholds.
- Sentence Length
different states might generate sentence with different size. A maximum sentence length is another parameter that was tested. Table 1 presents some data regarding the average word count in some of the game sceanrios.
After the sentence describing the state is generated, it is transformed to an embedding vector. Words that were not found in the vocabulary were replaced with an “OOV” vector. All words were then concatenated to a NxDx1 matrix, representing the state. We experimented with both Word2Vec and GloVe pretrained embedding vectors. Eventually, we used the latter, as it consumes less memory and speeds up the training process. The length of the state sentence is one of the hyperparameters of the agents; shorter sentences are zero padded, where longer ones are trimmed.
7.4 Model implementation
All of our models were implemented using PyTorch. The DQN agents used a single network that outputs the Q-Values of the available actions. The PPO agents used an Actor-Critic model with two networks; the first outputs the policy distribution for the input state, and the second network outputs it’s value. As mentioned earlier, we used three common neural network architectures:
- Convolutional Neural Network
used for the raw image and semantic segmentation based agents. VizDoom’s raw output image resolution is 640X480X3 RGB image. We experimented with both the original image and its down-sampled version. The semantic segmentation image was of resolution 640X480X1, where the pixel value represents the object’s class, generated using the VizDoom label API. the network consisted of two convolutional layers, two hidden linear layers and an output layer. The first convolutional layer has 8 6X6 filters with stride 3 and ReLU activation. The second convolutional layer has 16 3X3 filters with stride 2 and ReLU activation. The fully connected layers has 32 and 16 units, both of them are followed by ReLU activation. The output layer’s size is the amount of actions the agent has available in the trained scenario.
- Multilayer Perceptron
Used in the feature vector based agent. Naturally, some discretization is needed in order to build a feature vector, so some of the state data is lost. the feature vector was made using features we extracted from the VizDoom API, and its dimensions was 90 X 1. The network is made up of two fully connected layers, each of them followed by a ReLU activation. The first layer has 32 units, and the second one one has 16 units. The output layer’s size was the amount of actions available to the agent.
Used in the natural language based agent. As previously mentioned, each word in the natural language state is transformed into a 200X50X1 matrix. The first layers of the TextCNN are convolutional layers with 8 filter which are designed to scan input sentence, and return convolution outputs of sequences of varying lengths. The filters vary in width, such that each of them learns to identify different lengths of sequences in words. Longer filters have higher capability of extracting features from longer word sequences. The filters we have chosen have the following dimensions: 3X50X1, 4X50X1, 5X50X1, 8X50X1,11X50X1. Following the convolution layer there is a ReLU activation and a max pool layer. Finally, there are two fully connected layers; The first layer has 32 units, and second one has 16 units. Both of them are followed by ReLU activation.
All architectures have the same output, regardless of the input type. The DQN network is a regression network, with its output size the number of available actions. The PPO agent has 2 networks; actor and critic. The actor network has a Softmax activation with size equal to the available amount of actions. The critic network is a regression model with a single output representing the state’s value. Reward plots for the PPO agent can be found in Figure 8.
|Patches Count||Scenario||Average Word Count|
|3||basic light nuisance|
|5||basic light nuisance|
|11||basic light nuisance|
|21||basic light nuisance|
|3||basic heavy nuisance|
|5||basic heavy nuisance|
|11||basic heavy nuisance|
|21||basic heavy nuisance|
|3||defend the center|
|5||defend the center|
|11||defend the center|
|21||defend the center|
|3||defend the center light nuisance|
|5||defend the center light nuisance|
|11||defend the center light nuisance|
|21||defend the center light nuisance|
|3||defend the center heavy nuisance|
|5||defend the center heavy nuisance|
|11||defend the center heavy nuisance|
|21||defend the center heavy nuisance|
|SCENARIO||LIVING REWARD||KILL REWARD||DESCRIPTION|
|Basic||-1||100||aim and shoot at a single target|
|Health Gathering||1||0||collect health packs|
|Take Cover||1||0||dodge incoming missiles|
|Defend the Center||0||1||rotate and shoot incoming enemies|
|Defend the Line||0||1||shoot enemies while dodging missiles|
|Super Scenario||0||1||mixture of all the above|