Implicit Generative Modeling for Efficient Exploration
Abstract
Efficient exploration remains a challenging problem in reinforcement learning, especially for those tasks where rewards from environments are sparse. A commonly used approach for exploring such environments is to introduce some “intrinsic” reward. In this work, we focus on model uncertainty estimation as an intrinsic reward for efficient exploration. In particular, we introduce an implicit generative modeling approach to estimate a Bayesian uncertainty of the agent’s belief of the environment dynamics. Each random draw from our generative model is a neural network that instantiates the dynamic function, hence multiple draws would approximate the posterior, and the variance in the future prediction based on this posterior is used as an intrinsic reward for exploration. We design a training algorithm for our generative model based on the amortized Stein Variational Gradient Descent. In experiments, we compare our implementation with stateoftheart intrinsic rewardbased exploration approaches, including two recent approaches based on an ensemble of dynamic models. In challenging exploration tasks, our implicit generative model consistently outperforms competing approaches regarding data efficiency in exploration.
figuret
1 Introduction
Reinforcement learning (RL) has enjoyed recent success in a variety of applications, including superhuman performance in Atari games (mnih2013atari), robotic control (lillicrap2015continuous), imagebased control tasks (hafner2019planet), and playing the game of Go (silver2016go). Despite these achievements, many recent RL techniques still suffer from poor sample efficiency. Agents are often trained for millions, or even billions of simulation steps before achieving a reasonable performance (burda2018large). This lack of statistical efficiency makes it difficult to apply RL to realworld tasks, as the cost of acting in the real world is far greater than in a simulator. It is then a problem of utmost importance, to design agents that make efficient use of collected data. According to (sutton2018reinforcement), there are three key aspects in building a dataefficient agent for reinforcement learning: generalization, exploration, and longterm consequence awareness. In this work, we focus on the efficient exploration aspect. In particular, we focus on those challenging environments with sparse external rewards, where exploration is commonly driven by some sort of intrinsic reward. It is observed (osband2017value; osband2018prior) that a Bayesian uncertainty^{1}^{1}1As noted in (osband2018prior), it is the (epistemic) uncertainty in agent’s belief, rather than the (aleatoric) uncertainty of outcome which reflects the inherent randomness of the environment, that matters the most regarding efficient exploration in RL. estimate plays an important role in efficient exploration in deep RL, but is unfortunately not appropriately addressed in the majority of stateoftheart RL algorithms.
In this work, we introduce a new framework of Bayesian uncertainty modeling for intrinsic rewardbased exploration in RL. Our framework characterizes the (epistemic) uncertainty in the agent’s belief of the environment dynamics in a nonparametric way to enable flexibility and expressiveness. The main component of our framework is a network generator, each draw of which is a neural network that serves as the dynamic function for RL. Multiple draws then approximate a posterior of the dynamic model and the variance in future state prediction based on this posterior is used as an intrinsic reward for exploration. Recently, it has been shown (ratzlaff2019hypergan) that training such kind of generators can be done in classification problems and the resulting draws of networks can represent a rich distribution of networks that perform approximately equally well on the classification task. For our goal of training this generator for the dynamic function, we propose an algorithm to optimize the KL divergence between the implicit distribution (represented by draws from the generator) and the true posterior of the dynamic model (given the agent’s experience) via the amortized Stein Variational Gradient Descent (SVGD) (liu2016svgd; feng2017asvgd).
Comparing with recent works (pathak2019disagreement; shyam2019max) that maintain an ensemble of dynamic models and use the divergence or disagreement among them as an intrinsic reward for exploration, our implicit modeling of the posterior has several advantages: Firstly, it is a more flexible framework for approximating the model posterior comparing with ensemblebased approximation where the number of particles is fixed. Secondly, it is based on the principle of amortized SVGD (feng2017asvgd), where the KL divergence between the implicit posterior and the true posterior is directly minimized in a nonparametric sense, and further projected to a finitedimensional parameter update. This is in contrast with existing ensemblebased methods that count on the random initialization and/or bootstrapped experience sampling for the ensemble to approximate the posterior. Thirdly, it is more memory efficient given that our method stores and updates only parameters of the generator, in contrast with parameters of every member network of the ensemble.
In our experiments, we compare our approach with several stateoftheart intrinsic rewardbased exploration approaches, including two recent approaches that also leverage the uncertainty in dynamic models. In all the tasks we have tested, our implementation consistently outperforms competing methods regarding data efficiency in exploration.
In summary, our contributions are:

We propose a new framework for implicitly approximating the posterior of network parameters where the uncertainty of the network function can be used as an intrinsic reward for efficient exploration in RL.

We design an amortized SVGDbased training algorithm for the proposed framework and apply it to approximate the implicit distribution of the dynamic model of the environment.

We test our implementation on three challenging exploration tasks and compare with three stateoftheart intrinsic rewardbased methods, two of which are also based on uncertainty in dynamic models. The consistent superior performance of our method demonstrates the effectiveness of the proposed framework in estimating Bayesian uncertainty in the dynamic model for efficient exploration.
2 Problem Setup and Background
Consider a Markov Decision Process (MDP) represented as , where is the state space, is the action space. is the unknown dynamics model, specifying the probability of transitioning to next state from current state by taking action , as . is the reward function, is the distribution of initial states. A policy is a function , which outputs a distribution over the action space for given state .
2.1 Exploration in Reinforcement Learning
In online decisionmaking problems, such as multiarm bandits and reinforcement learning, a fundamental dilemma in an agent’s choice is exploitation versus exploration. Exploitation refers to making the best decision given current information, while exploration refers to gathering more information about the environment. In standard reinforcement learning setting where the agent receives an external reward for each transition step, common recipes for exploration/exploitation tradeoff include naive methods such as greedy (sutton2018reinforcement) and optimistic initialization (lai1985asymptotically), posterior guided methods such as upper confidence bounds (auer2002ucb; dani2008ucb) and Thompson sampling (thompson1933likelihood). In the situation we focus on, where external rewards are sparse or disregarded, the above tradeoff narrows down to the pure exploration problem of efficiently accumulating information about the environment. The common approach is to explore in a taskagnostic way under some “intrinsic” reward. An exploration policy can then be trained in the standard RL way where dense rewards are available. Existing methods construct intrinsic rewards from visitation frequency of the state (bellemare2016unifying), prediction error of the dynamic model as “curiosity” (pathak2017curiosity), diversity of visited states (eysenbach2018diversity), etc.
2.2 Dynamic Model Uncertainty as Intrinsic Reward
Following the guiding principle of modeling Bayesian uncertainty in online decision making, two recent methods (pathak2019disagreement; shyam2019max) train an ensemble of dynamic models and use the variation/information gain as an intrinsic reward for exploration. In this work, we follow the similar idea of exploiting the uncertainty in the the dynamic model, but emphasize on the implicit posterior modeling in contrast with directly training an ensemble of dynamic models.
Let denote a model of the environment dynamics (usually represented by a neural network) we want to learn based on the agent experience . We design a generator module which takes a random draw from the normal distribution and outputs a sample vector of parameters that determines (denoted as ). If samples from represent the posterior distribution , then given , the uncertainty in the output of the dynamics model can be computed by the following variance among a set of samples from , and used as an intrinsic reward for learning an exploration policy,
(1) 
In learning the exploration policy, this intrinsic reward can be computed with either actual rollouts in the environment or simulated rollouts generated by the estimeted dynamic model.
3 Posterior Approximation via Amortized SVGD
In this section, we introduce the core component of our exploration agent, the dynamic model generator . In the following subsections, we first introduce the design of this generator and then describe its training algorithm in detail. A summary of our algorithm is given in the last subsection.
3.1 Implicit Posterior Generator
As shown in Fig. 1, the dynamic model is defined as a layer neural network function , with input (state, action) pair and model parameters , where represents network parameters of the th layer. The generator module consists of exactly layerwise generators, , where each takes a random noise vector and outputs the corresponding parameter vector , where are the parameters of . Note that ’s are generated independently from a dimensional standard normal distribution, rather than jointly.
As mentioned in §1, this framework has advantages in flexibility and efficiency, comparing with ensemblebased methods (shyam2019max; pathak2019disagreement), since it maintains only parameters of the generators, i.e., , and enables drawing an arbitrary number of sample networks to approximate the posterior of the dynamic model.
3.2 Training with Amortized Stein Variational Gradient Descent
We now introduce the training algorithm of the generator module . Assuming that the true posterior of the dynamic model given agent’s experience is , and the implicit distribution of captured by is . We want to be as close as possible to , such closeness is commonly measured by the KL divergence . Traditional approach for finding that minimizes is variational inference, where an evidence lower bound (ELBO) is maximized. Recently, a nonparametric variational inference framework, Stein Variational Gradient Descent (SVGD) (liu2016svgd), was proposed, which represents with a set of particles rather than making any parametric assumptions, and approximates the functional gradient descent w.r.t. by iterative particle evolvement. We apply SVGD to our sampled network functions, and follow the idea of amortized SVGD (feng2017asvgd) to project the functional gradients to the parameter space of by backpropagation through the generators.
Given a set of dynamic functions sampled from , SVGD updates each function by
where is a step size, and is the function in the unit ball of a reproducing kernel Hilbert space (RKHS) that maximally decreases the KL divergence between the distribution represented by and the target posterior ,
This optimization problem has a closed form solution,
where is the positive definite kernel associated with the RKHS. We use a Gaussian kernel for our implementation. The loglikelihood term for corresponds to the negation of the regression loss of future state prediction for all transitions in , i.e., . Given that is determined by , the corresponding SVGD update rule for each sampled is,
where
(2) 
Given that ’s are generated by , the update rule for can be obtained by by the chain rule,
(3) 
where can be computed by (2) using empirical expectation from sampled batch ,
(4) 
Algorithm 1 shows our procedure in psuedocode. Starting with a buffer of random transitions, our algorithm samples a set of dynamic models from the generator , and updates the generator parameters using amortized SVGD (3) and (4). For policy update, the intrinsic reward (1) is evaluated on either the actual experience or the simulated experience generated by . The exploration policy is then updated using a modelfree RL algorithm on the collected experience and intrinsic rewards . The updated exploration policy is then used to rollout in the environment for steps so that new transitions are collected and added to the buffer for subsequent iterations.
3.3 Summary of the Exploration Algorithm
To condense what we have proposed so far, we summarize in Algorithm 1 the procedure used to train the generator of dynamic models and the exploration policies. We repeat the process, with the agent acting in the environment under the exploration policy and collecting new experience.
4 Experiments
In this section we conduct experiments to compare our approach to existing stateoftheart in efficient exploration with intrinsic reward. For our propose, only the taskagnostic setting is considered, where the agent explores the environment irrespective of the downstream task. Task agnostic exploration is essential when external rewards are sparse and there is large uncertainty in the environment.
4.1 Toy Task: NChain
As a sanity check, we first follow MAX (shyam2019max) to evaluate our method on a stochastic version of the toy environment NChain. As shown in Figure 2, the chain is a finite sequence of states. Each episode starts from state and lasts for steps. For each step, the agent can move forward to the next state in the chain or backward to the previous state. Attempting to move off the edge of the chain results in the agent staying still. Reward is only afforded to the agent at the edge states: for reaching state , and for reaching state . In addition, there is uncertainty built into the environment: each state is designated as a flipstate with probability . When acting from a flipstate, the agent’s actions are reversed, i.e., moving forward will result in movement backward, and viceversa. Given the (initially) random dynamics and a sufficiently long chain, we expect an agent using an greedy exploration strategy to exploit only the small reward of state . In contrast, agents with exploration policies which actively reduce uncertainty can efficiently discover every state in the chain. Figure 3 shows that our agent navigates the chain in less than 15 episodes, while indeed, the greedy agent (double DQN) does not make meaningful progress.
4.2 Continuous Control Environments
We also consider three challenging continuous control tasks in which efficient exploration is known to be difficult. In each environment, the dynamics are nonlinear and cannot be solved with simpler (efficient) tabular approaches. As stated above, external reward is completely removed; the agent is motivated purely by the uncertainty in its belief of the environment.
Experimental setup
To validate the effectiveness of our method, we compare with several stateoftheart formulations of intrinsic reward. Specifically, we conduct experiments comparing the following methods:

(Ours) The proposed intrinsic reward, using the estimated variance of an implicit distribution of the dynamic model.

(ICM) Error between predicted next state and observed next state (pathak2017curiosity).

(Disagreement) Variance of predictions from an ensemble of dynamic models (pathak2019disagreement).

(MAX) JensenRenyi information gain of the dynamic function (shyam2019max).

(Random) Pure random exploration as a naive baseline.
Implementation details
Given our goal is to compare the performance across different intrinsic rewards, we fix the model architecture, training pipeline, and hyperparameters across all methods.^{2}^{2}2We use the codebase of MAX as a basis and implement Ours, ICM, and Disagreement intrinsic rewards under the same framework. The dynamic models are 4 layer fully connected neural networks. For the purpose of computing the information gain, dynamic models for MAX predict both mean and variance of the next state, while for other methods, Dynamic models predict only the mean. Our generator as well as the dynamic models for other methods are optimized using Adam (kingma2014adam) with a learning rate of . To learn exploration policies, we use the Soft Actor Critic (haarnoja2018soft) algorithm for all methods. For MAX, ICM, and Disagreement, we use ensembles of 32 dynamic models respectively to compute the intrinsic reward. Since our method trains a generator of dynamic models instead, we fix the number of models we sample from the generator at for a fair comparison. Further implementation details can be found in the supplementary material.
4.2.1 Acrobot Control
Our first environment is a modified continuous control version of the Acrobot. As shown in Figure 4(a), the Acrobot environment begins with a hanging down pendulum which consists of two links connected by an actuated joint. Normally, an action applies or not () a unit force on the joint in the left or right direction. We modify the environment such that a continuous action applies a force in the corresponding direction.
To focus on efficient exploration, we test the ability of each exploration method to sweep the entire lower hemisphere: positioning the acrobot completely horizontal towards both (left and right) directions. Given this is a relatively simple task and can be solved by random exploration, as shown in Figure 4(b), all four intrinsic reward methods solve it within just hundreds of steps and our method is the most efficient one. The takeaway here is that in relatively simple environments where there might be little room for improvement over stateoftheart, our method still achieves a better performance due to its flexibility and efficiency in approximating the model posterior. We will see in subsequent experiments that this observation scales well with more difficult environments.
4.2.2 Ant Maze Navigation
Next, we evaluate on the Ant Maze environment (Figure 5(a)). In this control task, the agent provides torques to each of the 8 joints of the ant. The provided observation contains the pose of the torso as well as the angles and velocities of each joint. The agent’s performance is measured by the percentage of the Ushaped maze explored during evaluation. Figure 5(b) shows the result of each method over 5 seeds. Our agent consistently navigates to the end of the maze at the time when other methods have only explored 60% or less. We show how state visitation frequencies progress through training in figures 5(c)5(f). While MAX (shyam2019max) also navigates the maze, the more advanced uncertainty modeling scheme of our method allows our agent to better estimate the state novelty, which leads to a considerably quicker exploration.
4.2.3 Robotic Manipulation
The final task is an exploration task in a robotic manipulation environment, HandManipulateBlock. As shown in Figure 6(a), a robotic hand is given a palmsized block for manipulation. The agent has actuation control of the 20 joints that make up the hand, and its exploration performance is measured by the percentage of possible rotations of the cube that the agent performs.^{3}^{3}3This is different from the original goal of this environment since we want to evaluate taskagnostic exploration rather than goalbased policies. In particular, the state of the cube is represented by Cartesian coordinates along with a quaternion to represent the rotation. We transform the quaternion to Euler angles and discretize the resulting state space by degree intervals. The agent is evaluated based on how many of the 512 total states are visited.
This task is far more challenging than previous tasks, having a larger state space and action space. Additionally, states are difficult more difficult to reach than the Ant Maze environment; requiring manipulation of 20 joints instead of 8. In order to explore in this environment, an agent must also learn how to rotate the block without dropping it. Figure 6(b) shows the performance of each method over 5 seeds. This environment proved very challenging for all methods, none succeeded in exploring more than half of the state space. When placed in a complicated environment where the task is not clear, we want our agents to explore as fast as possible, in order to master the dynamics of the environment. For this environment, we can see that our method indeed performs the best by a clear margin, regarding exploration efficiency.
5 Related Work
Efficient Exploration remains a major challenge in deep reinforcement learning (fortunato2017noisy; burda2018exploration; eysenbach2018diversity; burda2018large), and there is no consensus on the correct way to explore an environment. One practical guiding principle for efficient exploration is the reduction of agent’s epistemic uncertainty of the environment (chaloner1995bayesian; osband2017value). osband2016bootdqn uses a bootstrap ensemble of DQNs, where the predictions of the ensemble are used as an estimate of the agent’s uncertainty over the value function. osband2018prior proposed to augment the predictions of a DQN agent by adding the contribution from a prior to the value estimate. In contrast to our method, these approaches seek to estimate the uncertainty in the value function, while we focus on exploration with intrinsic reward by estimating the uncertainty of the dynamic model. fortunato2017noisy add parameterized noise to the agent’s weights, to induce statedependant exploration beyond greedy or entropy bonus.
Methods for constructing intrinsic rewards for exploration have become the subject of increased study. One wellknown approach is to use the prediction error of an inverse dynamics model as an intrinsic reward (pathak2017curiosity; schmidhuber1991curious). schmidhuber1991curious and sun2011planning proposed using the learning progress of the agent as an intrinsic reward. Count based methods (bellemare2016unifying; ostrovski2017count) give a reward proportional to the visitation count of a state. HouVIME formulate exploration as a variational inference problem, and use Bayesian neural networks (BNN) to maintain the agent’s belief over the transition dynamics. The BNN predictions are used to estimate a form of Bayesian information gain called compression improvement. The variational approach is also explored in mohamed2015variational; gregor2016variational; salge2014empowerment, who proposed using intrinsic rewards based on a variational lower bound on empowerment; the mutual information between an action and the induced next state. This reward is used to learn a set of discriminative lowlevel skills. The most closelyrelated work to ours are two recent methods (pathak2019disagreement; shyam2019max) that compute intrinsic rewards from an ensemble of dynamic models. Disagreement among the ensemble members in nextstate predictions is computed as an intrinsic reward. shyam2019max also uses active exploration (schmidhuber2003exploring; chua2018deep), in which the agent is trained in a surrogate MDP, to maximize intrinsic reward before acting in the real environment. Our method follows the similar idea of exploiting the uncertainty in the dynamic model, but instead suggests an implicit generative modeling of the posterior of the dynamic function, which enables a more flexible approximation of the posterior uncertainty with better sample efficiency.
There has been a wealth of research on nonparametric particlebased variational inference methods (liu2016svgd; dai2016provable; ambrogioni2018wasserstein), where a certain number of particles are maintained to represent the variational distribution, and updated by solving an optimization problem. Notably, we make use of the amortized SVGD (feng2017asvgd) to optimize our generator for approximately sampling from the posterior of the dynamic model.
6 Conclusion
In this work, we introduced a novel method for representing the agent’s uncertainty of the environment dynamics. We formulated an intrinsic reward based on the uncertainty given by an approximate posterior of the dynamic model to enable efficient exploration in difficult environments, Through experiments in control, navigation, and manipulation, we demonstrated that our method is consistently more sample efficient than the baseline methods. Future work includes investigating the efficacy of learning an approximate posterior of the agent’s value or policy model, as well as more efficient sampling techniques.
References
Appendix A Appendix
a.1 Implementation Details for Continuous Environments
Here we describe in more detail the various implementation choices we used for our method as well as for the baselines.
Toy Chain Environment
The chain environment was implemented based on the NChainv0 gym environment. We altered NChainv0 to contain 40 states instead of 10 to reduce the possibility of solving the environment with random actions. We also modified the stochastic ’slipping’ state behavior by fixing the behavior of the states respect to reversing an action. For both our method and MAX, we use ensembles of 5 deterministic neural networks with 4 layers, each are 256 units wide with tanh nonlinearities. As usual, our ensembles are sampled from the generator at each timestep, while MAX uses a static ensemble. We generate each layer in the target network with generators composed of two hidden layers, 64 units each with ReLU nonlinearities. Both models are trained by minimizing the regression loss on the observed data. We optimize using Adam with a learning rate of , and weight decay of . We use Monte Carlo Tree Search (MCTS) to find exploration policies for use in the environment. We build the tree with 25 iterations of 10 random trajectories, and UCB1 as the selection criteria. Crucially, during rollouts, we query the dynamic models instead of the simulator, and we compute the corresponding intrinsic reward. For MAX we use the Jensen Shannon divergence while our method uses the variance in the predictions of our samples. There is a small discrepancy between the numbers reported in the MAX paper for the chain environment. This is due to using UCB1 as the selection criteria instead of Thompson sampling as used in the MAX paper. We take actions in the environment based on the children with the highest value. The tree is then discarded after one step, after which, the dynamic models are fit for 10 additional epochs.
Continuous Control Environments
For each method, we kept the implementation details consistent across the environments Acrobot, Ant Maze, and Block Manipulation. The common details of each exploration method are as follows. Each method uses (or samples) ensembles to approximate environment dynamics. Models in the ensemble are composed of 32 networks with 4 hidden layers, 512 units wide with ReLU nonlinearities, except for MAX which uses swish^{4}^{4}4Swish refers to the nonlinearity proposed by (ramachandran2017searching) which is expressed as a scaled sigmoid function: . ICM, Disagreement, and our method use ensembles of deterministic models, while MAX uses probabilistic networks which output a Gaussian distribution over next states. The approximate dynamic models (ensembles/generators) are optimized with Adam, using a minibatch size of 256, a learning rate of , and weight decay of .
For our dynamic model, each layer generator is composed of two hidden layers, 64 units wide and ReLU nonlinearity. The output dimensionality of each generator is equal to the product of the input and output dimensionality of the corresponding layer in the dynamic model. To sample one dynamic model, each generator takes as input an independent draw from where where is a 32 dimensional identity matrix. We sample ensembles of arbitrary size by instead providing a batch as input. To train the generator such that we can sample accurate transition models we update according to Equation 4; we compute the regression error on the data, as well as the repulsive term using an appropriate kernel. For all experiments we used a standard Gaussian kernel where is the median of the pairwise distances between sampled particles . Because we sample functions instead of data points, the pairwise distance is computed by using the likelihood of the data under the model: .
For MAX, we used the code provided from (shyam2019max). Each member in the approximate dynamic model ensemble is a probabilistic neural network that predicts a Gaussian distribution (with diagonal covariance) over next states. The exploration policies are trained with SAC, given an experience buffer of rollouts performed by dynamic models, where is the intrinsic reward: the JensenRenyi divergence between next state predictions of the dynamic model. the policy trained with SAC acts in the environment to maximize the intrinsic reward, and in doing so collects additional transitions that serve as training data for the dynamic models for the subsequent training phase.
For Disagreement (pathak2019disagreement), we follow the author’s implementation, changing minimal details. The intrinsic reward is formulated as the predictive variance of the approximate dynamic model, where the model is given by a bootstrap ensemble. In this work, we report results using two versions of this method. The proposed intrinsic reward specifically is formulated in a manner quite similar to our own, however, an ensemble is used instead of a distribution for the approximate posterior. In section §4 we report results only using the intrinsic reward, instead of the full proposed method which makes use of a differentiable reward function, which treats the reward as a supervised loss. We do this because the form of the approximate dynamic model does not preclude the use of different policy optimization techniques. Nonetheless, in the next section §A.2 we report results using the full method as proposed, on each continuous control experiment.
The Bayesian approach is extended in (HouVIME)
a.2 Additional Results
Here we report additional results comparing with Disagreement, including the original policy optimization method with a differentiable reward function used in (pathak2019disagreement). We repeat our main experiments, comparing our method to both disagreement purely as an intrinsic reward, as well as the full method using the differentiable reward function for policy optimization. For the full method we use the author’s official implementation for the following experiments. This is in contrast to the method reported in the main text where we only implement the intrinsic reward. Figures 7, 8, and 9 show results on the Acrobot, Ant Maze, and Block Manipulation environments, respectively. In each figure, lines correspond to the mean of three seeds, and shaded regions denote one standard deviation. In each experiment, we can see that treating the intrinsic reward as a supervised loss (gray) improves on the baseline disagreement intrinsic reward (green). However, our method (red) remains the most sample efficient in these experiments.