Deep Variational Reinforcement Learning

Many real-world sequential decision making problems are partially observable by nature, and the environment model is typically unknown. Consequently, there is great need for reinforcement learning methods that can tackle such problems given only a stream of incomplete and noisy observations. In this paper, we propose \glsDVRL, which introduces an inductive bias that allows an agent to learn a generative model of the environment and perform inference in that model to effectively aggregate the available information. We develop an -step approximation to the \glsELBO, allowing the model to be trained jointly with the policy. This ensures that the latent state representation is suitable for the control task. In experiments on Mountain Hike and flickering Atari we show that our method outperforms previous approaches relying on recurrent neural networks to encode the past.


VAEvaevariational autoencoder \newacronymIWAEiwaeimportance weighted autoencoder \newacronymISisimportance sampling \newacronym[firstplural=partially observable Markov decision processes]POMDPpomdppartially observable Markov decision process \newacronym[firstplural=Markov decision processes]MDPmdpMarkov decision process \newacronymRLrlreinforcement learning \newacronymADRQNadrqnaction-specific deep recurrent Q-network \newacronym(A)DRQNadrqn \newacronymRNNrnnrecurrent neural network \newacronymVRNNvrnnvariational recurrent neural network \newacronymAESMCaesmcautoencoding sequential Monte Carlo \newacronymSMCsmcsequential Monte Carlo \newacronymDVRLdvrldeep variational reinforcement learning \newacronymA3CA3Casynchronous advantage actor-critic \newacronymA2CA2Cadvantage actor-critic \glsunsetA2C \newacronymELBOelboevidence lower bound \newacronymDQNdqndeep Q-network \newacronymDPFPdpfpdeep particle filter based policy \newacronymCNNcnnconvolutional neural network \newacronymDRQNdrqndeep recurrent Q-network \newacronymKLklKullback-Leibler \newacronymVINvinvalue iteration network \newacronymGRUgrugated recurrent unit \newacronymLSTMlstmlong short term memory \newacronymNNnnneural network \newacronymSSMssmstate space model \newacronymADR-A2Cadr-a2caction-specific deep recurrent AC network \newacronym[firstplural=Bayes-adaptive partically observable decision processes]BA-POMDPba-pomdpBayes-adaptive partically observable decision process \newacronymESSesseffective sample size \newacronymBPTTbpttbackpropagation-throught-time



We would like to thank Wendelin Boehmer and Greg Farquar for useful discussions and feedback. The NVIDIA DGX-1 used for this research was donated by the NVIDIA corporation. M. Igl is supported by the UK EPSRC CDT in Autonomous Intelligent Machines and Systems. L. Zintgraf is supported by the Microsoft Research PhD Scholarship Program. T. A. Le is supported by EPSRC DTA and Google (project code DF6700) studentships. F. Wood is supported by DARPA PPAML through the U.S. AFRL under Cooperative Agreement FA8750-14-2-0006; Intel and DARPA D3M, under Cooperative Agreement FA8750-17-2-0093. S. Whiteson is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713).


