Deep Variational Reinforcement Learning

Deep Variational Reinforcement Learning

Abstract

Many real-world sequential decision making problems are partially observable by nature, and the environment model is typically unknown. Consequently, there is great need for reinforcement learning methods that can tackle such problems given only a stream of incomplete and noisy observations. In this paper, we propose \glsDVRL, which introduces an inductive bias that allows an agent to learn a generative model of the environment and perform inference in that model to effectively aggregate the available information. We develop an -step approximation to the \glsELBO, allowing the model to be trained jointly with the policy. This ensures that the latent state representation is suitable for the control task. In experiments on Mountain Hike and flickering Atari we show that our method outperforms previous approaches relying on recurrent neural networks to encode the past.

\glsdisablehyper\newacronym

VAEvaevariational autoencoder \newacronymIWAEiwaeimportance weighted autoencoder \newacronymISisimportance sampling \newacronym[firstplural=partially observable Markov decision processes]POMDPpomdppartially observable Markov decision process \newacronym[firstplural=Markov decision processes]MDPmdpMarkov decision process \newacronymRLrlreinforcement learning \newacronymADRQNadrqnaction-specific deep recurrent Q-network \newacronym(A)DRQNadrqn \newacronymRNNrnnrecurrent neural network \newacronymVRNNvrnnvariational recurrent neural network \newacronymAESMCaesmcautoencoding sequential Monte Carlo \newacronymSMCsmcsequential Monte Carlo \newacronymDVRLdvrldeep variational reinforcement learning \newacronymA3CA3Casynchronous advantage actor-critic \newacronymA2CA2Cadvantage actor-critic \glsunsetA2C \newacronymELBOelboevidence lower bound \newacronymDQNdqndeep Q-network \newacronymDPFPdpfpdeep particle filter based policy \newacronymCNNcnnconvolutional neural network \newacronymDRQNdrqndeep recurrent Q-network \newacronymKLklKullback-Leibler \newacronymVINvinvalue iteration network \newacronymGRUgrugated recurrent unit \newacronymLSTMlstmlong short term memory \newacronymNNnnneural network \newacronymSSMssmstate space model \newacronymADR-A2Cadr-a2caction-specific deep recurrent AC network \newacronym[firstplural=Bayes-adaptive partically observable decision processes]BA-POMDPba-pomdpBayes-adaptive partically observable decision process \newacronymESSesseffective sample size \newacronymBPTTbpttbackpropagation-throught-time

\printAffiliationsAndNotice

Acknowledgements

We would like to thank Wendelin Boehmer and Greg Farquar for useful discussions and feedback. The NVIDIA DGX-1 used for this research was donated by the NVIDIA corporation. M. Igl is supported by the UK EPSRC CDT in Autonomous Intelligent Machines and Systems. L. Zintgraf is supported by the Microsoft Research PhD Scholarship Program. T. A. Le is supported by EPSRC DTA and Google (project code DF6700) studentships. F. Wood is supported by DARPA PPAML through the U.S. AFRL under Cooperative Agreement FA8750-14-2-0006; Intel and DARPA D3M, under Cooperative Agreement FA8750-17-2-0093. S. Whiteson is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713).

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
202228
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description