Exploration for Multi-task Reinforcement Learning with Deep Generative Models

Exploration for Multi-task Reinforcement Learning with Deep Generative Models

Sai Praveen B
Department of Computer Science & Engineering
Indian Institute of Technology Madras
&JS Suhas
Department of Computer Science & Engineering
Indian Institute of Technology Madras
&Balaraman Ravindran
Department of Computer Science & Engineering
Indian Institute of Technology Madras

Exploration in multi-task reinforcement learning is critical in training agents to deduce the underlying MDP. Many of the existing exploration frameworks such as , , Thompson sampling assume a single stationary MDP and are not suitable for system identification in the multi-task setting. We present a novel method to facilitate exploration in multi-task reinforcement learning using deep generative models. We supplement our method with a low dimensional energy model to learn the underlying MDP distribution and provide a resilient and adaptive exploration signal to the agent. We evaluate our method on a new set of environments and provide intuitive interpretation of our results.



1 Introduction

Learning to solve multiple tasks simultaneously is the Multi-task reinforcement learning(MTRL) problem. MTRL can be solved by either planning after deducing the current MDP or ignoring MDP deduction and learning a policy over all the MDPs combined. For example, say, a marker in the environment determines the obstacle structure and the goal locations. Any agent will have to visit the marker to learn the environment structure, and hence, deducing the task to be solved becomes an important part of the agents policy.
MTRL is different from Transfer learning, which aims to generalize (or transfer) knowledge from a set of tasks to solve a new, but similar, task. MTRL also involves learning a common representation for a set of tasks, but makes no attempt to generalize to new tasks.
Conventional reinforcement learning algorithms like Q-learning, SARSA etc fail to identify this decision making sub-problem. They also learn sub-optimal policies because of the non-stationary reward structure since the goal location varies from episode to episode. To address these issues, current research in MTRL is driven towards using a model over all possible MDPs to deduce the current MDP or using some form of memory to store past observations and then reason based on this history. These methods will be able to deduce the MDP if they happen to see the markers, but make no effort to actively search and deduce the MDP.
Our methods incentivize the agent to actively seek markers and deduce the MDP by providing a smooth, adaptive exploration signal, as a pseudo-reward, obtained from a generative model using deep neural networks. Since our method relies on computing the Jacobian of some state representation with respect to the input, we use deep neural networks to learn this state representation. To the best of our knowledge, we are the first to propose using the Jacobian as an exploration bonus.
Our exploration bonus is added to intrinsic reward like in Bayesian Exploration Bonus. We focus on grid-worlds with colors for markers. For clarity, we redefine a state s, to be a single grid-location and its pixel value, , as the observation. Our methods can, however, generalize to arbitrarily complex distributions on the observations X. We also assume that the agent can deduce rewards and transition probabilities from the observation.

2 Related Work

There is extensive research in the field of exploration strategies for reducing uncertainty in the MDP. , are examples of widely used exploration strategies. Bayesian Exploration Bonus assigns a pseudo reward to states calculated using frequency of state visitation. Thomson sampling samples an MDP from the posterior distribution computed using evidence(rewards and transition probabilities) that it obtains from trajectories. We follow a similar approach to sample MDPs. Recent advances in this domain involve sampling using Dropout Neural Networksgal2015dropout. However, these algorithms assume there exists a single stationary MDP for each episode(which we refer to as Single-Task RL). Our algorithm addresses the Multi-task RL problem where each episode uses an MDP sampled from an arbitrary distribution on MDPs. Contrary to the STRL exploration strategies, our exploration bonus is designed to mark states potentially useful for the agent to improve its certainty about the current MDP.
Recent advances in MTRL and Transfer Learning algorithms like Value Iteration Networkstamar2016value and Actor-Mimic Networksparisotto2015actor that attempt to identify common structure among tasks and generalize learning to new tasks with similar structure. In the context of MTRL and Transfer Learning on environments that give image-like observations, Value Iteration Networks employ Recurrent Neural Networks for value iteration and learn kernel functions and to estimate the reward and transition probabilities for a state from its immediate surroundings. This has the effect of easily generalizing to new tasks which share the same MDP structure( and for a state can be determined using locality assumptions). Our work does not attempt to learn common structure across MDPs for the purpose of transfer learning. Instead, we attempt to learn the input MDP distribution to deduce the current MDP given the observations.
oh2016control proposes a novel method using Deep Recurrent Memory Networks to learn policies on Minecraft multi-task environments. They used a fixed memory of past observations. At each step, a context vector is generated and the memory network is queried for relevant information. This model successfully learns policies on I-shaped environments where the color of a marker cell determines the goal location. In their experiments with the I-world and Pattern-Recognition world, the identifier states are very close to the agents starting position.
Another class of MTRL algorithms focuses on deducing the current MDP using Bayesian Reasoning. Multi-class models proposed by lazaric2010bayesian and wilson2007multi, attempt to assign class labels to the current MDP given a sequence of observations made from it. wilson2007multi use a Hierarchical Bayesian Model(HBM) to learn a conditional distribution over class labels given the observations. The agent samples an MDP from the posterior distribution in a manner similar to Thomson sampling, and then chooses the action. We follow the same procedure for action selection, but incorporate exploration bonuses into it as well.
lazaric2010bayesian proposes Multi-class Multi-task Learning(MCMTL), a non-parametric Hierarchical Bayesian Model to learn inherent structure in the value functions of each class. MCMTL clusters MDPs into classes and learns a posterior distribution over the MDPs given observed evidence. This is similar to our work, but it does not explicitly incentivize the agent to visit marker states.

Our contributions are two fold. First, we propose a deep generative model to allow sampling from posterior distribution. Second, we propose a novel exploration bonus using the models posterior distribution.

3 Background

3.1 Variational Auto Encoders

Variational Auto Encoders(VAE)kingma2013auto attempt to learn the distribution that generated the data , . VAEs, like standard autoencoders have an encoder, , and a decoder component. Generative models that attempt to estimate use a likelihood objective function, or . More formally, the objective function can be written as

where is defined to be .
Gradient-motivated learning requires approximation of the integral with samples. In high-dimensional -space, this could lead to large estimation errors as is likely to be concentrated around a few select s and it would take an infeasible number of samples to get proper estimate. VAEs circumvent this problem by introducing a new distribution to sample from. To reduce parameters, we use . These two functions are approximated with a deep network and form the encoder component of the VAE. is represented using the sampling function where and forms the decoder component of the VAE. After some mathematical sleight of hand to account for in the learning equations(walker2016uncertain provides an intuitive understanding of these equations), we obtain the following formulation of the loss function

where the KL-divergence term exists to adjust for importance sampling from instead of .

3.2 Gaussian-Binary Restricted Boltzmann Machines

RBMs have been used widely to learn energy models over an input distribution . RBM is an undirected, complete bipartite, probabilistic graphical model with hidden units, , and visible units, . In Gaussian-Binary RBMs, hidden units are binary units(Bernoulli distribution) capable of representing a total of combinations, while the visible units use the Gaussian distribution. The network is parametrized by edge weights matrix between each node of and , and bias vectors and for and respectively. Given a visible state ,
the hidden state, , is obtained by sampling the posterior given by

Given a hidden state , visible state is obtained by sampling the posterior given by

Since RBMs model conditional distributions, conditional distributions( and ) have a closed form while marginal and joint distributions(, and ) are impossible to compute without explicit summation over all combinations.

Parameters are learnt using contrastive divergence(hinton2010practical). Learning , however, proved to be unstable(hinton2010practical) and hence, we treat as a hyperparameter and use .

4 Deep Generative Model

4.1 Encoding

Let us consider the nature of our inputs. We have assumed that the agents observed surroundings are embedded on a map as an image . A mask is a binary image, of the same dimensions, with if its corresponding state has been observed by the agent. We denote the pixel and mask be denoted by and respectively.

In most episodes, the agent will not visit the entire grid-world, hence for some . Since there can be several views of the same ground-truth MDP, we need to be able to reconstruct the ground-truth MDP from multiple observations of the MDP over several episodes. For Single-Task RL, this can be done in a tabular fashion. In MTRL, however, we have potentially infinite possible MDPs and it becomes hard to build association between different views of the same MDP.
We use deep convolutional VAEs to infer the association between different views of the same MDP and use it with a low dimensional energy model to sample MDPs given observational evidence. Here, locality of the world features warrant the use of convolution layers in the VAE. Figure 1 shows our setup to learn the associations and to infer ground truth MDP given observations. We use one setup to train the model and another to allow back sampling of MDPs. We call these the train and query models.
Our method can be scaled to large state spaces because of the VAE.

(a) Train Model
(b) Query Model
Figure 1: Deep Generative Model - Train model requires mask inputs to account for missing observations. Query model involves value iteration to determine best action over sampled MDPs.

Given this setup, for the learning phase, we modify the VAE loss function to account for unobserved states, with . The new loss function is given by

Inclusion of in the loss function is quite intuitive and works well on the sets that we tested on, since it removes any penalty for unseen and allows the VAE to project its knowledge onto the unseen states.

4.2 Sampling

Given a partial observation, , we sample for the posterior to obtain MDP samples. If doesn’t have enough evidence to skew the posterior in favour of one single MDP, then the encoding produced by VAE, , is far from encodings of ground-truth MDPs, in -space. We obtain an MDP that is a mixture of MDPs if we sample from this posterior. Solving this MDP could result in the agent following a policy unsuitable for any of the component MDPs in isolation.
One way to circumvent this problem is to train a probability distribution over the MDP embeddings, . For our 2-MDP environments, we use a Gaussian-Boltzmann RBM to cluster inputs with fixed-variance gaussians. We then use Algorithm 1 to sample from these gaussians.

Result: MDPs sampled from model posterior
Sample hidden RBM states , from the posterior
Calculate MAP estimate
Decode map estimates to get MDP samples
Algorithm 1 Sample MDPs given

4.3 Value function

Given samples from model posterior, , we perform action selection using an aggregate value function over the samples. We define, for each state , an aggregate value function as

where is an MDP and denotes the value function for state under MDP . can be obtained using any standard planning algorithms and we use value iteration(with =0.95, 40 iterations). Action selection is done using -greedy mechanism with . Since recomputing value functions at each step is computationally infeasible, each selected action persists for steps.
We note that value functions used need not be exact, but can be approximate as they are only used for steps. A quicker estimate can be obtained using Monte-Carlo methods when the state-space is large.

5 Jacobian Exploration Bonus

To incentivize the agent to visit decisive pixels/locations, we introduce a bonus based on the change in the embedding . Intuitively, the embedding has the highest change when the VAE detects changes that are relevant to the distribution that it is modelling. The bonus can be summarised as follows:

where denotes the list of observations made at state . We use a transfer function to bound activations produced by the Jacobian, thus mainitaining numerical stability. This bonus can be used in two ways - as a pseudo reward,

or to replace the actual reward.

where is the actual reward deduced by the agent. While both methods showed improvement, the latter worked better since total reward for states which already gave a high reward was not further increased. Since changes drastically with new observations, is recomputed every time the is to be recomputed. is also memory-less i.e. it doesn’t carry over any information from one episode to the next.

Figure 2: Final Jacobian Bonus for BW-E and BW-H - Locations in yellow-green are identified by the agent as being most helpful in deducing the MDP being solved.

6 Experiments

6.1 Testbench

We have implemented the following algorithms.

  • Value Iteration, referred to as STRL

  • Multi-task RL with VAE, RBM without exploration bonus, referred to as MTRL-0

  • Multi-task RL with VAE, RBM, and Jacobian Bonus, referred to as MTRL-

We have tested the above algorithms on 2 environments.

  • Back World (Easy) [BW-E] - Goal location alternates depending on marker location color, marker location is fixed and is in most paths from start to goal. This domain demonstrates the advantage gained using a probabilistic model over the MDPs.

  • Back World (Hard) [BW-H] - Same setting as BW-H, but marker location is not on most paths from start to goal. This domain demonstrates the advantage provided by the Jacobian exploration bonus and our generative model.

For STRL, using only visible portions of the environment was very unstable and hence, we had to add a pseudo reward. For each unseen location, we provide a pseudo reward, for step (with ), that is annealed by a factor of . Each episode was terminated at 200 steps if the agent hadn’t reached the goal. Using this pseudo reward, the agent was forcefully terminated fewer times. These worlds become challenging due to partial visibility. We use a 5x5 kernel with clipped corners and the agent is always assumed to be at the center. At each step, the environment tracks the locations that the agent has seen and presents it to the agent before an action is taken.
For our experiments, we consider the average number of steps to goal as a measure of loss and average reward as a measure of performance.

(a) BW-E A
(b) BW-E B
(c) BW-H A
(d) BW-H B
(e) Suboptimal path[BW-H]
(f) Optimal path[BW-H]
(g) Visibility Kernel
Figure 3: 28x28 worlds used in our experiments - White indicates start position of agent. Green and Yellow are marker locations. Red locations are failures. Blue locations are all successes. Gray areas in kernel are visible to the agent. White cell in kernel is the agents position. Shown optimal path considers MDP deduction as a sub-problem.

7 Results

7.1 Navigation

Table 2 gives average reward for each agent. Table 2 gives average episode length. We also impose forceful termination at 200 steps if episode has not yet completed. From the results, we infer the following.

  • STRL using value iteration does poorly as it as no way of deducing MDPs.

  • MTRL-0 solves both BW-E and BW-H environments and does almost as good as MTRL-. This improvement can be attributed to the use of our deep generative model.

  • MTRL- shows better results on BW-H. This was expected as MTRL-0 makes no attempt to visit marker locations. MTRL- is motivated by the Jacobian Bonus to visit marker locations, thereby deducing the MDP.

  • MTRL-0 performs as good as MTRL- in BW-E as marker locations lie on most paths to the goal. However, since it fails to understand the significance of the marker locations and markers in BW-H are not on most paths to the goal, it results in longer episodes and lower reward.

BW-E 0.21 0.99 0.99
BW-H 0.23 0.92 0.99
Table 2: Average Episode Length
BW-E 184.19 46.20 46.29
BW-H 183.64 54.0 45.8
Table 1: Average Reward

7.2 Visualizations

7.2.1 RBM Training

To visualize RBM training, we used the BW-E environment. We used a random agent,(-greedy policy with =1 and restricted actions at boundaries so it doesn’t bump into walls) on this environment to navigate and collect a sample, of the seen environment, at the end of each episode. We then encoded the samples using VAE and used it to train a RBM. Figure 4 shows the clusters and their means for each gaussian fit by the RBM. Since there are only two possible BW-E environments, RBM only fits two gaussians and our encoded samples are clustered around the same. Training was done in minibatches of 64 samples with 1 hidden unit in the RBM.

(a) Epoch 0
(b) Epoch 3
(c) Epoch 6
(d) Epoch 9
(e) Epoch 12
(f) Epoch 15
(g) Epoch 26
(h) Epoch 79
Figure 4: Visualization of RBM training on encoded BW-E world samples - Red points are means of the fitted gaussians. Blue points are data points in the minibatch. Black points are samples from the gaussians fitted by RBM. Spread of the black points is a measure of the variance of the fitted gaussians.

We see two distinct clusters in each snapshot. RBM iteratively refines its parameters to fit the means close to the encoded sample clusters. Due to perfect reconstruction from VAE, there is no spread of encoded samples.

7.2.2 VAE Training

To visualize the training of VAE, we use the same setup as for visualizing RBM training. We used 400 samples from BW-E and training was done in mini-batches of 128 samples. 4 test samples were randomly chosen and reconstruction from VAE was recorded after 30, 60, 90 and 120 epochs of training. Figure 5 shows the training progress of VAE on BW-E samples.

(a) Input
(b) Epoch 30
(c) Epoch 60
(d) Epoch 90
(e) Epoch 120
Figure 5: Visualization of VAE training on BW-E : After 30 epochs, VAE has learnt the general structure of the grid-world, but has yet to learn colors. After 60 epochs, it has learnt colors for marker pixels that were in most training examples, but has yet to learn goal state colors. After 90 epochs, it has learn colors for goal state, but is yet to learn colors at corners as they are present in very few samples. After 120 epochs, learning is complete.

8 Conclusions and Future Work

We have presented a new method using a deep generative model to provide an exploration bonus for solving the multi-task reinforcement learning problem. Our modification to the VAE loss function allows it to learn from partial inputs and also the associations between different views of the same environment. Use of RBMs to learn a distribution over allows us to sample actual MDPs instead of a mixture of MDPs. We introduced an intuitive exploration bonus and have shown improvements over existing baselines.
One drawback of Jacobian Bonus is that it doesn’t use the reward structure of the MDPs. This bonus could yield sub-optimal policies in environments with multiple markers and associated rewards. We would like to incorporate the reward structure into the Jacobian Bonus to have some form of utility interpretation.
Our deep generative model is scalable and we would like to explore learning in larger worlds and extend our method to work with Minecraft-like 3D environments. \printbibliography

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description