EMI: Exploration with Mutual Information Maximizing State and Action Embeddings
Abstract
Policy optimization struggles when the reward feedback signal is very sparse and essentially becomes a random search algorithm until the agent accidentally stumbles upon a rewarding or the goal state. Recent works utilize intrinsic motivation to guide the exploration via generative models, predictive forward models, or more adhoc measures of surprise. We propose EMI, which is an exploration method that constructs embedding representation of states and actions that does not rely on generative decoding of the full observation but extracts predictive signals that can be used to guide exploration based on forward prediction in the representation space. Our experiments show the state of the art performance on challenging locomotion task with continuous control and on imagebased exploration tasks with discrete actions on Atari.
EMI: Exploration with Mutual Information Maximizing State and Action Embeddings
Seoul National University, Department of Computer Science and Engineering 
UC Berkeley, Department of Electrical Engineering and Computer Sciences 
{harry2636,jaekyeom,yeonwoo,hyunoh}@mllab.snu.ac.kr 
svlevine@eecs.berkeley.edu 
1 Introduction
The central task in reinforcement learning is to learn policies that would maximize the total reward received from interacting with the unknown environment. Although recent methods have demonstrated to solve a range of complex tasks (Mnih et al., 2015; Schulman et al., 2015; 2017), the success of these methods, however, hinges on whether the agent constantly receives the intermediate reward feedback or not. In case of challenging environments with sparse reward signals, these methods struggle to obtain meaningful policies unless the agent luckily stumbles into the rewarding or predefined goal states.
To this end, prior works on exploration generally utilize some kind of intrinsic motivation mechanism to provide a measure of surprise. These measures can be based on density estimation via generative models (Bellemare et al., 2016; Fu et al., 2017; Oh et al., 2015), predictive forward models (Stadie et al., 2015; Houthooft et al., 2016), or more adhoc measures that aim to approximate surprise (Pathak et al., 2017). Methods based on predictive forward models and generative models must model the distribution over state observations, which can make them difficult to scale to complex, highdimensional observation spaces, while models that eschew direct forward predictive or density estimation rely on heuristic measures of surprise that may not transfer effectively to a wide range of tasks.
Our aim in this work is to devise a method for exploration that does not require a direct generation of highdimensional state observations, while still retaining the benefits of being able to measure surprise based on the forward prediction. If exploration is performed by seeking out states that maximize surprise, the problem, in essence, is in measuring surprise, which requires a representation where functionally similar states are close together, and functionally distinct states are far apart.
In this paper, we propose to learn compact representations for both the states and actions simultaneously satisfying the following criteria: First, given the representations of state and the corresponding next state, the uncertainty of the representation of the corresponding action should be minimal. Second, given the representations of the state and the corresponding action, the uncertainty of the representation of the corresponding next state should also be minimal. Third, the action embedding representation should seamlessly support both the continuous and discrete actions. Finally, we impose the linear dynamics model in the representation space which can also explain the rare irreducible error under the dynamics model. Given the representation, we guide the exploration by measuring surprise based on forward prediction and relative increase in diversity in the embedding representation space. Figure 1 illustrates an example of our learned state and action embedding representations and the linearity of sample transitions in the representation space in Montezuma’s Revenge.
We present two main technical contributions that make this into a practical exploration method. First, we describe how compact state and action representations can be constructed via Donsker & Varadhan (1983) estimation of mutual information without relying on generative decoding of full observations. Second, we show that imposing linear topology on the learned embedding representation space (such that the transitions are linear), thereby offloading most of the modeling burden onto the embedding function itself, provides an essential informative measure of surprise when visiting novel states.
For the experiments, we show that we can use our representations on a range of complex imagebased tasks and robotic locomotion tasks with continuous actions. We report state of the art results compared to recent intrinsic motivation based exploration methods (Fu et al., 2017; Pathak et al., 2017) on several challenging Atari tasks and robotic locomotion tasks with sparse rewards.
2 Related works
Our work is related to the following strands of active research:
Unsupervised representation learning via mutual information estimation Recent literature on unsupervised representation learning generally focus on extracting latent representation maximizing approximate lower bound on the mutual information between the code and the data. In the context of generative adversarial networks (Goodfellow et al., 2014), Chen et al. (2016); Belghazi et al. (2018) aims at maximizing the approximation of mutual information between the latent code and the raw data. Belghazi et al. (2018) estimates the mutual information with neural network via Donsker & Varadhan (1983) estimation to learn better generative model. Hjelm et al. (2018) builds on the idea and trains a decoderfree encoding representation maximizing the mutual information between the input image and the representation. Furthermore, the method uses Nowozin et al. (2016) estimation of JensenShannon divergence rather than the KL divergence to estimate the mutual information for better numerical stability. Oord et al. (2018) estimates mutual information via autoregressive model and makes predictions on local patches in an image.
Exploration with intrinsic motivation Prior works on exploration mostly employ intrinsic motivation to estimate the measure of novelty or surprisal to guide the exploration. Bellemare et al. (2016) utilize density estimation via CTS (Bellemare et al., 2014) generative model and derive pseudocounts as the intrinsic motivation. Fu et al. (2017) avoids building explicit density models by training Kexemplar models that distinguish a state from all other observed states. Some methods train predictive forward models (Stadie et al., 2015; Houthooft et al., 2016; Oh et al., 2015) and estimate the prediction error as the intrinsic motivation. Oh et al. (2015) employs generative decoding of the full observation via recursive autoencoders and thus can be challenging to scale for high dimensional observations. VIME (Houthooft et al. (2016)) approximates the environment dynamics, uses the information gain of the learned dynamics model as intrinsic rewards, and showed encouraging results on robotic locomotion problems. However, the method needs to update the dynamics model per each observation and is unlikely to be scalable for complex tasks with high dimensional states such as Atari games.
Other approaches utilize more adhoc measures (Pathak et al., 2017; Tang et al., 2017) that aim to approximate surprise. ICM (Pathak et al. (2017)) transforms the high dimensional states to feature space and imposes cross entropy and euclidean loss so the action and the feature of the next state are predictable. However, ICM does not utilize the mutual information like VIME to directly measure the uncertainty and is limited to discrete actions. Our method (EMI) is also reminiscent of (Kohonen & Somervuo, 1998) in a sense that we seek to construct a decoderfree latent space from the high dimensional observation data with a topology in the latent space. In contrast to the prior works on exploration, we seek to construct the representation under linear topology and does not require decoding the full observation but seek to encode the essential predictive signal that can be used for guiding the exploration.
3 Preliminaries
We consider a Markov decision process defined by the tuple , where is the set of states, is the set of actions, is the environment transition distribution, is the reward function, and is the discount factor. Let denote a stochastic policy over actions given states. Denote as the distribution of initial state . The discounted sum of expected rewards under the policy is defined by
(1) 
where denotes the trajectory, and . The objective in policy based reinforcement learning is to search over the space of parameterized policies (i.e. neural network) in order to maximize .
Also, denote as the joint probability distribution of singleton experience tuples starting from and following the policy . Furthermore, define as the marginal distribution of actions, as the marginal distribution of states and the corresponding next states, as the marginal distribution of the next states, and as the marginal distribution of states and the actions following the policy .
4 Methods
Our goal is to construct the embedding representation of the observation and action (discrete or continuous) for complex dynamical systems that does not rely on generative decoding of the full observation, but still provides a useful predictive signal that can be used for exploration. This requires a representation where functionally similar states are close together, and functionally distinct states are far apart. We approach this objective from maximizing mutual information under several criteria.
4.1 Mutual information maximizing state and action embedding representations
We first introduce the embedding function of states and actions with parameters and (i.e. neural networks) respectively. We seek to learn the embedding function of states () and actions () satisfying the following two criteria:

Given the embedding representation of states and the actions , the uncertainty of the embedding representation of the corresponding next states should be minimal and vice versa.

Given the embedding representation of states and the corresponding next states , the uncertainty of the embedding representation of the corresponding actions should also be minimal and vice versa.
Intuitively, the first criterion translates to maximizing the mutual information between and which we define as in Equation 2. And the second criterion translates to maximizing the mutual information between and defined as in Equation 3.
(2) 
(3) 
Mutual information is not bounded from above and maximizing mutual information is notoriously difficult to compute in high dimensional settings. Motivated by Hjelm et al. (2018); Belghazi et al. (2018), we compute Donsker & Varadhan (1983) lower bound of mutual information. Concretely, DonskerVaradhan representation is a tight estimator for the mutual information of two random variables and , derived as in Equation 4.
(4) 
where is a differentiable transform with parameter .
Theorem 1.
Proof.
(5) 
where the inequality in the second line holds from the definition of divergence (Nowozin et al., 2016). In the third line, we substituted and Fenchel conjugate of JensenShannon divergence, . ∎
Furthermore, for better numerical stability, we approximate KLdivergence with JensenShannon divergence (JSD) (Hjelm et al., 2018) which is bounded both from below and above by and . From creftypecap 1, we have
(6)  
(7)  
where . The expectations in Equation 6 and Equation 7 are approximated using the empirical samples trajectories . Note, the samples and from the marginals are obtained by dropping and in samples and from . Figure 2 illustrates the computational architecture for estimating the lower bounds on and .
4.2 Embedding linear dynamics model under sparse noise
Since the embedding representation space is learned, it is natural to impose a topology on it (Kohonen, 1983). In EMI, we impose a simple and convenient topology where transitions are linear since this spares us from having to also represent a complex dynamical model. This allows us to offload most of the modeling burden onto the embedding function itself, which in turn provides us with a useful and informative measure of surprise when visiting novel states. Once the embedding representations are learned, this linear dynamics model allows us to measure surprise in terms of the residual error under the model or measure diversity in terms of the similarity in the embedding space. Section 5 discusses the intrinsic reward computation procedure in more detail.
Concretely, we seek to learn the representation of states and the actions such that the representation of the corresponding next state follow linear dynamics i.e. . Intuitively, we would like the nonlinear aspects of the dynamics to be offloaded to the neural networks so that in the embedding space, the dynamics become linear. Regardless of the expressivity of the neural networks, however, there always exists irreducible error under the linear dynamic model. For example, the state transition which leads the agent from one room to another in Atari environments (i.e. Venture, Montezuma’s revenge, etc.) or the transition leading the agent in the same position under certain actions (i.e. Agent bumping into a wall when navigating a maze environment) would be extremely challenging to explain under the linear dynamics model.
To this end, we introduce the error model , which is another neural network taking the state and action as input, estimating the irreducible error under the linear model. Motivated by the work of Candès et al. (2011), we seek to minimize for the sparsity of the term so that the error term contributes only on rare unexplainable occasions. Equation 8 shows the embedding learning problem under linear dynamics with sparse errors.
(8) 
where we used the matrix notation for compactness. denotes the matrices of respective embedding representations stacked columns wise. Relaxing the norm with norm, Equation 9 shows our final learning objective.
(9) 
are hyperparameters which control the relative contributions of the linear dynamics error and the sparsity. In practice, we found the optimization process to be more stable when we further regularize the distribution of action embedding representation to follow a predefined prior distribution. Concretely, we regularize the action embedding distribution to follow a standard normal distribution via in similar spirit to VAEs Kingma & Welling (2013). Intuitively, this has the effect of grounding the distribution of action embedding representation (and consequently state embedding representation) across different iterations of the learning process.
5 Intrinsic reward augmentation
We consider two different formulations of computing the intrinsic reward. First, we consider a relative difference in the novelty of state representations based on the distance in the embedding representation space similar to Oh et al. (2015) as shown in Equation 10. The relative difference makes sure the intrinsic reward diminishes to zero (Ng et al., 1999) once the agent has sufficiently explored the state space. Also, we consider a formulation based on the prediction error under the linear dynamics model as shown in Equation 11. This formulation incorporates the sparse error term and makes sure we differentiate the irreducible error that does not contribute as the novelty.
(10) 
(11) 
Note the relative diversity term should be computed after the representations are updated based on the samples from the latest trajectories while the prediction error term should be computed before the update. Algorithm 1 shows the complete learning procedure in detail.
6 Experiments
We compare the experimental performance of EMI to recent prior works on both of the lowdimensional locomotion tasks with continuous control from rllab benchmark (Duan et al., 2016) and the complex visionbased tasks with discrete control from the Arcade Learning Environment (Bellemare et al., 2013). For the locomotion tasks, we chose SwimmerGather and SparseHalfCheetah environments for direct comparison against the prior work of Fu et al. (2017). SwimmerGather is a hierarchical task where a twolink robot needs to reach green pellets giving positive reward instead of red pellets giving negative reward. SparseHalfCheetah is a challenging locomotion task where a cheetahlike robot does not receive any reward until it moves 5 units in one direction.
For visionbased tasks, we selected Freeway, Frostbite, Venture, Montezuma’s Revenge, Gravitar, and Solaris for comparison with recent prior works (Pathak et al., 2017; Fu et al., 2017). These six Atari environments feature very sparse reward feedback and often contain many moving distractor objects which can be challenging for the methods that rely on explicit decoding of the full observations (Oh et al., 2015).
6.1 Implementation Details
We use TRPO (Schulman et al., 2015) for policy optimization because of its capability to support both the discrete and continuous actions and its robustness with respect to the hyperparameters. In the locomotion experiments, we use a 2layer fully connected neural network as the policy network. In the Atari experiments, we use a 2layer convolutional neural network followed by a single layer fully connected neural network. We convert the 84 x 84 input RGB frames to grayscale images and resize them to 52 x 52 images following the practice in Tang et al. (2017). The embedding dimensionality is set to in all of the environments except for Gravitar and Solaris where we set due to their complex environment dynamics. We use Adam (Kingma & Ba, 2015) optimizer to train embedding networks. Please refer to Section A.1 for more details.
6.2 Locomotion tasks with continuous control
We compare EMI with TRPO (Schulman et al., 2015) and EX2 (Fu et al., 2017) on two challenging locomotion environments: SwimmerGather and SparseHalfCheetah. Note that as ICM (Pathak et al., 2017) does not support continuous control, we omit the comparison with ICM for this experiment. Figure 4 shows that EMI significantly outperforms the baseline methods by a large margin on both tasks. Figure 2(b) visualizes the scatter plot of the learned state embeddings and an example trajectory for the SparseHalfCheetah experiment. The figure shows that the learned representation successfully preserves the similarity in observation space.
6.3 Visionbased tasks with discrete control
For visionbased exploration tasks, we compare EMI with TRPO (Schulman et al., 2015), EX2 (Fu et al., 2017), and ICM (Pathak et al., 2017). Our results in Figure 5 show that EMI achieves the state of the art performance on Freeway, Frostbite, Venture, and Montezuma’s Revenge in comparison to the baseline exploration methods. Figures 2(f), 2(e), 2(d) and 2(c) illustrate our learned state embeddings . Since our embedding dimensionality is set to , we directly visualize the scatter plot of the embedding representation in 2D. Figure 2(d) shows that the embedding space naturally separates state samples into two clusters each of which corresponds to different rooms in Montezuma’s revenge. Figure 2(f) shows smooth sample transitions along the embedding space in Frostbite where functionally similar states are close together and distinct states are far apart. For information about how our error term works in those visionbased tasks, please refer to Section A.2.
7 Conclusion
We presented EMI, a practical exploration method that does not rely on direct generation of high dimensional observations while extracting the predictive signal that can be used for exploration within a compact representation space. Our results on challenging robotic locomotion tasks with continuous actions and high dimensional imagebased games with sparse rewards show that our approach transfers to a wide range of tasks and shows state of the art results significantly outperforming recent prior works on exploration. As future work, we would like to explore utilizing the learned linear dynamic model for optimal planning in the embedding representation space. In particular, we would like to investigate how an optimal trajectory from a state to a given goal in the embedding space under the linear representation topology translates to the optimal trajectory in the observation space under complex dynamical systems.
Acknowledgements
This work is supported by Samsung Advanced Institute of Technology.
References
 Belghazi et al. (2018) Ishmael Belghazi, Sai Rajeswar, Aristide Baratin, R Devon Hjelm, and Aaron Courville. Mutual information neural estimation. In International Conference on Machine Learning, volume 2018, 2018.
 Bellemare et al. (2014) Marc Bellemare, Joel Veness, and Erik Talvitie. Skip context tree switching. In International Conference on Machine Learning, pp. 1458–1466, 2014.
 Bellemare et al. (2016) Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479, 2016.
 Bellemare et al. (2013) Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
 Candès et al. (2011) Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):11, 2011.
 Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180, 2016.
 Donsker & Varadhan (1983) Monroe D Donsker and SR Srinivasa Varadhan. Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics, 36(2):183–212, 1983.
 Duan et al. (2016) Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338, 2016.
 Fu et al. (2017) Justin Fu, John CoReyes, and Sergey Levine. Ex2: Exploration with exemplar models for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2577–2587, 2017.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Hjelm et al. (2018) R Devon Hjelm, Alex Fedorov, Samuel LavoieMarchildon, Karan Grewal, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
 Houthooft et al. (2016) Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pp. 1109–1117, 2016.
 Kingma & Ba (2015) Diederik P Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kohonen (1983) Teuvo Kohonen. Representation of information in spatial maps which are produced by selforganization. In Synergetics of the Brain, pp. 264–273. Springer, 1983.
 Kohonen & Somervuo (1998) Teuvo Kohonen and Panu Somervuo. Selforganizing maps of symbol strings. Neurocomputing, 21(13):19–30, 1998.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Ng et al. (1999) Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp. 278–287, 1999.
 Nowozin et al. (2016) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fgan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271–279, 2016.
 Oh et al. (2015) Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Actionconditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pp. 2863–2871, 2015.
 Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
 Pathak et al. (2017) Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiositydriven exploration by selfsupervised prediction. In International Conference on Machine Learning, volume 2017, 2017.
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, volume 2015, 2015.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Stadie et al. (2015) Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
 Tang et al. (2017) Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of countbased exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2753–2762, 2017.
Appendix A Appendix
a.1 Experiment Hyperparameters
In all experiments, we use Adam optimizer with a learning rate of 0.001 and a minibatch size of 512 for 3 epochs to optimize embedding networks. In each iteration, we utilized collected TRPO batch at each iteration to train embedding networks except for SparseHalfCheetah which uses FIFO replay buffer of size 250000. The embedding dimensionality is set to in all of the environments except for Gravitar and Solaris where we set . Relative diversity term is used as an intrinsic reward with the weight of 0.1, except for Venture and Montezuma’s Revenge where the intrinsic reward is set as a prediction error term with the weight of 0.001. For the convenience of experiments, we set = 1 and tune the coefficient term that is multiplied to the information terms in Equation 9. The following tables give the detailed information of the remaining hyperparameters.
Environments  SwimmerGather  SparseHalfCheetah  
TRPO step size  0.01  
TRPO batch size  50k  5k  
Policy network  A 2layer FC with (64, 32) hidden units (tanh)  
Baseline network  A 32 hidden units FC (ReLU)  Linear baseline  
network  Same structure as policy network  
network  A 64 hidden units FC (ReLU)  
Information network  A 2layer FC with (64, 64) hidden units (ReLU)  
Error network 


Max path length  500  
Discount factor  0.995  
0.05  
0.1  
Environments  Freeway, Frostbite, Venture, Montezuma’s Revenge, Gravitar, Solaris 
TRPO step size  0.01 
TRPO batch size  100k 
Policy network  2 convolutional layers (16 8x8 filters of stride 4, 32 4x4 filters of stride 2), followed by a 256 hidden units FC (ReLU) 
Baseline network  Same structure as policy network 
network  Same structure as policy network 
network  A 64 hidden units FC (ReLU) 
Information network  A 2layer FC with (64, 64) hidden units (ReLU) 
Error network  State input passes the same network structure as policy network. Concat layer concatenates state output and action. A 256 units FC (ReLU) 
Max path length  4500 
Discount factor  0.995 
0.1  
0.5  
a.2 Experimental evaluation of the error model
In order to understand how the error term in EMI works in practice, we visualize three representative transition samples in Figure 6.
In the case of Figure 5(a), due to the discrepancy between the two different background images, usually becomes large which makes the error term larger, too. The norm of the error term for this specific sample was and the resulting residual error was . Figure 5(b) describes the case where the action chosen by the policy has no effect on i.e. . Linear models without any noise terms can easily fail in such events. Thus, the error term in our model gets bigger to mitigate the modeling error. for this example transition was and its corresponding residual error was .
On the other hand, Figure 5(c) represents cases that the chosen action works in the environment as intended. Although this statement alone does not guarantee the effectiveness of the linear model, the error terms are likely to be small for most of the samples. The norm of the error term for the actual sample was and its residual error was .