Exploration for Multitask Reinforcement Learning with Deep Generative Models
Abstract
Exploration in multitask reinforcement learning is critical in training agents to deduce the underlying MDP. Many of the existing exploration frameworks such as , , Thompson sampling assume a single stationary MDP and are not suitable for system identification in the multitask setting. We present a novel method to facilitate exploration in multitask reinforcement learning using deep generative models. We supplement our method with a low dimensional energy model to learn the underlying MDP distribution and provide a resilient and adaptive exploration signal to the agent. We evaluate our method on a new set of environments and provide intuitive interpretation of our results.
citeme
1 Introduction
Learning to solve multiple tasks simultaneously is the Multitask reinforcement learning(MTRL) problem. MTRL can be solved by either planning after deducing the current MDP or ignoring MDP deduction and learning a policy over all the MDPs combined. For example, say, a marker in the environment determines the obstacle structure and the goal locations. Any agent will have to visit the marker to learn the environment structure, and hence, deducing the task to be solved becomes an important part of the agents policy.
MTRL is different from Transfer learning, which aims to generalize (or transfer) knowledge from a set of tasks to solve a new, but similar, task. MTRL also involves learning a common representation for a set of tasks, but makes no attempt to generalize to new tasks.
Conventional reinforcement learning algorithms like Qlearning, SARSA etc fail to identify this decision making subproblem. They also learn suboptimal policies because of the nonstationary reward structure since the goal location varies from episode to episode. To address these issues, current research in MTRL is driven towards using a model over all possible MDPs to deduce the current MDP or using some form of memory to store past observations and then reason based on this history. These methods will be able to deduce the MDP if they happen to see the markers, but make no effort to actively search and deduce the MDP.
Our methods incentivize the agent to actively seek markers and deduce the MDP by providing a smooth, adaptive exploration signal, as a pseudoreward, obtained from a generative model using deep neural networks. Since our method relies on computing the Jacobian of some state representation with respect to the input, we use deep neural networks to learn this state representation. To the best of our knowledge, we are the first to propose using the Jacobian as an exploration bonus.
Our exploration bonus is added to intrinsic reward like in Bayesian Exploration Bonus. We focus on gridworlds with colors for markers. For clarity, we redefine a state s, to be a single gridlocation and its pixel value, , as the observation. Our methods can, however, generalize to arbitrarily complex distributions on the observations X. We also assume that the agent can deduce rewards and transition probabilities from the observation.
2 Related Work
There is extensive research in the field of exploration strategies for reducing uncertainty in the MDP. , are examples of widely used exploration strategies. Bayesian Exploration Bonus assigns a pseudo reward to states calculated using frequency of state visitation. Thomson sampling samples an MDP from the posterior distribution computed using evidence(rewards and transition probabilities) that it obtains from trajectories. We follow a similar approach to sample MDPs. Recent advances in this domain involve sampling using Dropout Neural Networksgal2015dropout. However, these algorithms assume there exists a single stationary MDP for each episode(which we refer to as SingleTask RL). Our algorithm addresses the Multitask RL problem where each episode uses an MDP sampled from an arbitrary distribution on MDPs. Contrary to the STRL exploration strategies, our exploration bonus is designed to mark states potentially useful for the agent to improve its certainty about the current MDP.
Recent advances in MTRL and Transfer Learning algorithms like Value Iteration Networkstamar2016value and ActorMimic Networksparisotto2015actor that attempt to identify common structure among tasks and generalize learning to new tasks with similar structure. In the context of MTRL and Transfer Learning on environments that give imagelike observations, Value Iteration Networks employ Recurrent Neural Networks for value iteration and learn kernel functions and to estimate the reward and transition probabilities for a state from its immediate surroundings. This has the effect of easily generalizing to new tasks which share the same MDP structure( and for a state can be determined using locality assumptions). Our work does not attempt to learn common structure across MDPs for the purpose of transfer learning. Instead, we attempt to learn the input MDP distribution to deduce the current MDP given the observations.
oh2016control proposes a novel method using Deep Recurrent Memory Networks to learn policies on Minecraft multitask environments.
They used a fixed memory of past observations. At each step, a context vector is generated and the memory network is queried for relevant information. This model successfully learns policies on Ishaped environments where the color of a marker cell determines the goal location. In their experiments with the Iworld and PatternRecognition world, the identifier states are very close to the agents starting position.
Another class of MTRL algorithms focuses on deducing the current MDP using Bayesian Reasoning. Multiclass models proposed by lazaric2010bayesian and wilson2007multi, attempt to assign class labels to the current MDP given a sequence of observations made from it. wilson2007multi use a Hierarchical Bayesian Model(HBM) to learn a conditional distribution over class labels given the observations. The agent samples an MDP from the posterior distribution in a manner similar to Thomson sampling, and then chooses the action. We follow the same procedure for action selection, but incorporate exploration bonuses into it as well.
lazaric2010bayesian proposes Multiclass Multitask Learning(MCMTL), a nonparametric Hierarchical Bayesian Model to learn inherent structure in the value functions of each class. MCMTL clusters MDPs into classes and learns a posterior distribution over the MDPs given observed evidence. This is similar to our work, but it does not explicitly incentivize the agent to visit marker states.
Our contributions are two fold. First, we propose a deep generative model to allow sampling from posterior distribution. Second, we propose a novel exploration bonus using the models posterior distribution.
3 Background
3.1 Variational Auto Encoders
Variational Auto Encoders(VAE)kingma2013auto attempt to learn the distribution that generated the data , . VAEs, like standard autoencoders have an encoder, , and a decoder component. Generative models that attempt to estimate use a likelihood objective function, or . More formally, the objective function can be written as
where is defined to be .
Gradientmotivated learning requires approximation of the integral with samples. In highdimensional space, this could lead to large estimation errors as is likely to be concentrated around a few select s and it would take an infeasible number of samples to get proper estimate. VAEs circumvent this problem by introducing a new distribution to sample from. To reduce parameters, we use . These two functions are approximated with a deep network and form the encoder component of the VAE. is represented using the sampling function where and forms the decoder component of the VAE.
After some mathematical sleight of hand to account for in the learning equations(walker2016uncertain provides an intuitive understanding of these equations), we obtain the following formulation of the loss function
where the KLdivergence term exists to adjust for importance sampling from instead of .
3.2 GaussianBinary Restricted Boltzmann Machines
RBMs have been used widely to learn energy models over an input distribution . RBM is an undirected, complete bipartite, probabilistic graphical model with hidden units, , and visible units, . In GaussianBinary RBMs, hidden units are binary units(Bernoulli distribution) capable of representing a total of combinations, while the visible units use the Gaussian distribution. The network is parametrized by edge weights matrix between each node of and , and bias vectors and for and respectively. Given a visible state ,
the hidden state, , is obtained by sampling the posterior given by
Given a hidden state , visible state is obtained by sampling the posterior given by
Since RBMs model conditional distributions, conditional distributions( and ) have a closed form while marginal and joint distributions(, and ) are impossible to compute without explicit summation over all combinations.
Parameters are learnt using contrastive divergence(hinton2010practical). Learning , however, proved to be unstable(hinton2010practical) and hence, we treat as a hyperparameter and use .
4 Deep Generative Model
4.1 Encoding
Let us consider the nature of our inputs. We have assumed that the agents observed surroundings are embedded on a map as an image . A mask is a binary image, of the same dimensions, with if its corresponding state has been observed by the agent. We denote the pixel and mask be denoted by and respectively.
In most episodes, the agent will not visit the entire gridworld, hence for some . Since there can be several views of the same groundtruth MDP, we need to be able to reconstruct the groundtruth MDP from multiple observations of the MDP over several episodes. For SingleTask RL, this can be done in a tabular fashion. In MTRL, however, we have potentially infinite possible MDPs and it becomes hard to build association between different views of the same MDP.
We use deep convolutional VAEs to infer the association between different views of the same MDP and use it with a low dimensional energy model to sample MDPs given observational evidence. Here, locality of the world features warrant the use of convolution layers in the VAE. Figure 1 shows our setup to learn the associations and to infer ground truth MDP given observations. We use one setup to train the model and another to allow back sampling of MDPs. We call these the train and query models.
Our method can be scaled to large state spaces because of the VAE.
Given this setup, for the learning phase, we modify the VAE loss function to account for unobserved states, with . The new loss function is given by
Inclusion of in the loss function is quite intuitive and works well on the sets that we tested on, since it removes any penalty for unseen and allows the VAE to project its knowledge onto the unseen states.
4.2 Sampling
Given a partial observation, , we sample for the posterior to obtain MDP samples. If doesn’t have enough evidence to skew the posterior in favour of one single MDP, then the encoding produced by VAE, , is far from encodings of groundtruth MDPs, in space. We obtain an MDP that is a mixture of MDPs if we sample from this posterior. Solving this MDP could result in the agent following a policy unsuitable for any of the component MDPs in isolation.
One way to circumvent this problem is to train a probability distribution over the MDP embeddings, . For our 2MDP environments, we use a GaussianBoltzmann RBM to cluster inputs with fixedvariance gaussians. We then use Algorithm 1 to sample from these gaussians.
4.3 Value function
Given samples from model posterior, , we perform action selection using an aggregate value function over the samples. We define, for each state , an aggregate value function as
where is an MDP and denotes the value function for state under MDP . can be obtained using any standard planning algorithms and we use value iteration(with =0.95, 40 iterations). Action selection is done using greedy mechanism with . Since recomputing value functions at each step is computationally infeasible, each selected action persists for steps.
We note that value functions used need not be exact, but can be approximate as they are only used for steps. A quicker estimate can be obtained using MonteCarlo methods when the statespace is large.
5 Jacobian Exploration Bonus
To incentivize the agent to visit decisive pixels/locations, we introduce a bonus based on the change in the embedding . Intuitively, the embedding has the highest change when the VAE detects changes that are relevant to the distribution that it is modelling. The bonus can be summarised as follows:
where denotes the list of observations made at state . We use a transfer function to bound activations produced by the Jacobian, thus mainitaining numerical stability. This bonus can be used in two ways  as a pseudo reward,
or to replace the actual reward.
where is the actual reward deduced by the agent. While both methods showed improvement, the latter worked better since total reward for states which already gave a high reward was not further increased. Since changes drastically with new observations, is recomputed every time the is to be recomputed. is also memoryless i.e. it doesn’t carry over any information from one episode to the next.
6 Experiments
6.1 Testbench
We have implemented the following algorithms.

Value Iteration, referred to as STRL

Multitask RL with VAE, RBM without exploration bonus, referred to as MTRL0

Multitask RL with VAE, RBM, and Jacobian Bonus, referred to as MTRL
We have tested the above algorithms on 2 environments.

Back World (Easy) [BWE]  Goal location alternates depending on marker location color, marker location is fixed and is in most paths from start to goal. This domain demonstrates the advantage gained using a probabilistic model over the MDPs.

Back World (Hard) [BWH]  Same setting as BWH, but marker location is not on most paths from start to goal. This domain demonstrates the advantage provided by the Jacobian exploration bonus and our generative model.
For STRL, using only visible portions of the environment was very unstable and hence, we had to add a pseudo reward. For each unseen location, we provide a pseudo reward, for step (with ), that is annealed by a factor of . Each episode was terminated at 200 steps if the agent hadn’t reached the goal. Using this pseudo reward, the agent was forcefully terminated fewer times.
These worlds become challenging due to partial visibility. We use a 5x5 kernel with clipped corners and the agent is always assumed to be at the center. At each step, the environment tracks the locations that the agent has seen and presents it to the agent before an action is taken.
For our experiments, we consider the average number of steps to goal as a measure of loss and average reward as a measure of performance.
7 Results
7.1 Navigation
Table 2 gives average reward for each agent. Table 2 gives average episode length. We also impose forceful termination at 200 steps if episode has not yet completed. From the results, we infer the following.

STRL using value iteration does poorly as it as no way of deducing MDPs.

MTRL0 solves both BWE and BWH environments and does almost as good as MTRL. This improvement can be attributed to the use of our deep generative model.

MTRL shows better results on BWH. This was expected as MTRL0 makes no attempt to visit marker locations. MTRL is motivated by the Jacobian Bonus to visit marker locations, thereby deducing the MDP.

MTRL0 performs as good as MTRL in BWE as marker locations lie on most paths to the goal. However, since it fails to understand the significance of the marker locations and markers in BWH are not on most paths to the goal, it results in longer episodes and lower reward.
World  STRL  MTRL0  MTRL 

BWE  0.21  0.99  0.99 
BWH  0.23  0.92  0.99 
World  STRL  MTRL0  MTRL 

BWE  184.19  46.20  46.29 
BWH  183.64  54.0  45.8 
7.2 Visualizations
7.2.1 RBM Training
To visualize RBM training, we used the BWE environment. We used a random agent,(greedy policy with =1 and restricted actions at boundaries so it doesn’t bump into walls) on this environment to navigate and collect a sample, of the seen environment, at the end of each episode. We then encoded the samples using VAE and used it to train a RBM. Figure 4 shows the clusters and their means for each gaussian fit by the RBM. Since there are only two possible BWE environments, RBM only fits two gaussians and our encoded samples are clustered around the same. Training was done in minibatches of 64 samples with 1 hidden unit in the RBM.
We see two distinct clusters in each snapshot. RBM iteratively refines its parameters to fit the means close to the encoded sample clusters. Due to perfect reconstruction from VAE, there is no spread of encoded samples.
7.2.2 VAE Training
To visualize the training of VAE, we use the same setup as for visualizing RBM training. We used 400 samples from BWE and training was done in minibatches of 128 samples. 4 test samples were randomly chosen and reconstruction from VAE was recorded after 30, 60, 90 and 120 epochs of training. Figure 5 shows the training progress of VAE on BWE samples.
8 Conclusions and Future Work
We have presented a new method using a deep generative model to provide an exploration bonus for solving the multitask reinforcement learning problem. Our modification to the VAE loss function allows it to learn from partial inputs and also the associations between different views of the same environment. Use of RBMs to learn a distribution over allows us to sample actual MDPs instead of a mixture of MDPs. We introduced an intuitive exploration bonus and have shown improvements over existing baselines.
One drawback of Jacobian Bonus is that it doesn’t use the reward structure of the MDPs. This bonus could yield suboptimal policies in environments with multiple markers and associated rewards. We would like to incorporate the reward structure into the Jacobian Bonus to have some form of utility interpretation.
Our deep generative model is scalable and we would like to explore learning in larger worlds and extend our method to work with Minecraftlike 3D environments.
\printbibliography