Randomized Value Functions via Multiplicative Normalizing Flows
Randomized value functions offer a promising approach towards the challenge of efficient exploration in complex environments with high dimensional state and action spaces. Unlike traditional point estimate methods, randomized value functions maintain a posterior distribution over action-space values. This prevents the agent’s behavior policy from prematurely exploiting early estimates and falling into local optima. In this work, we leverage recent advances in variational Bayesian neural networks and combine these with traditional Deep Q-Networks (DQN) to achieve randomized value functions for high-dimensional domains. In particular, we augment DQN with multiplicative normalizing flows in order to track an approximate posterior distribution over its parameters. This allows the agent to perform approximate Thompson sampling in a computationally efficient manner via stochastic gradient methods. We demonstrate the benefits of our approach through an empirical comparison in high dimensional environments.
Randomized Value Functions via Multiplicative Normalizing Flows
Ahmed Touati1, 3, Harsh Satija1,3, Joshua Romoff2,3, Joelle Pineau2,3, Pascal Vincent1,3 1MILA, Université de Montréal 2MILA, McGill University 3Facebook AI Research
noticebox[b]Preprint. Work in progress.\end@float
Efficient exploration is one of the main obstacles in scaling up modern deep reinforcement learning (RL) algorithms (Bellemare et al., 2016; Osband et al., 2017; Fortunato et al., 2017). The main challenge in efficient exploration is the balance between exploiting current estimates, and gaining information about poorly understood states and actions. Despite the wealth of research into provably efficient exploration strategies, most methods focus on tabular representations and are typically intractable in high dimensional environments (Strehl and Littman, 2005; Kearns and Singh, 2002; Brafman and Tennenholtz, 2002). Currently, the most widely used technique, in Deep RL, involves perturbing the greedy action with some local random noise, e.g -greedy or Bolzmann exploration (Sutton and Barto, 1998). This naive perturbation is not directed; it continuously explores actions that are known to be sub-optimal and may result in sample complexity that grows exponentially with the number of states (Kearns and Singh, 2002; Osband et al., 2017).
Optimism in the face of uncertainty is one of the traditional guiding principles that offers provably efficient learning algorithms (Strehl and Littman, 2005; Kearns and Singh, 2002; Brafman and Tennenholtz, 2002; Jaksch et al., 2010). These algorithms incentivize learning about the environment by rewarding the discoveries of poorly understood states and actions with an exploration bonus. In these approaches, the agent first builds a confidence set over Markov Decision Processes (MDPs) that contains the true MDP with high probability. Then, the agent determines the most optimistic and statistically plausible version of its model and acts optimally with respect to it. Inspired by this principle, several Deep RL works prescribe guided exploration strategies, such as pseudo-counts (Bellemare et al., 2016), variational information maximization (Houthooft et al., 2016) and model prediction errors (Stadie et al., 2015). All of the aforementioned methods add an intrinsic reward to the original reward and then simply train traditional Deep RL algorithms on the augmented MDP.
An entire body of algorithms for efficient exploration is inspired by Thompson sampling (Thompson, 1933). Bayesian dynamic programming was first introduced in Strens (2000) and is more recently known as posterior sampling for reinforcement learning (PSRL) (Osband et al., 2013). In PSRL, the agent starts with a prior belief over world model and then proceeds to update its full posterior distribution over models with the newly observed samples. A model hypothesis is then sampled from this distribution, and a greedy policy with respect to the sampled model is followed thereafter. Unfortunately, due to their high computational cost, these methods are only feasible on small MDPs and are of limited practical use in high dimensional environments.
Osband et al. (2017) developed randomized value functions in order to improve the scalability of PSRL. At an abstract level, randomized value functions can be interpreted as a model free version of PSRL. Instead of maintaining a posterior belief over possible models, the agent’s belief is expressed over value functions. Similarly to PSRL, a value function is sampled at the start of each episode and actions are selected greedily thereafter. Subsequently, actions with highly uncertain values are explored due to the variance in the sampled value functions. In order to scale this approach to large MDPs with linear function approximation, Osband et al. (2016b) introduce randomized least-square value iteration (RLSVI) which involves using Bayesian linear regression for learning the value function.
In the present work, we are interested in using randomized value functions with deep neural networks as a function approximator. To address the issues with computational and statistical efficiency, we leverage recent advances in variational Bayesian neural networks. Specifically, we use normalizing multiplicative flows (MNF) (Louizos and Welling, 2017) in order to account for the uncertainty of estimates for efficient exploration. MNF is a recently introduced family of approximate posteriors for Bayesian neural networks that allows for arbitrary dependencies between neural network parameters.
Our main contribution is the introduction of MNFs into standard value based Deep RL algorithms. We highlight the approach experimentally by comparing our method against recent Deep RL baselines on several challenging exploration domains, including the Arcade Learning Environment (ALE) (Bellemare et al., 2013). Based on our experiments, we conclude that the richness of the approximate posterior in MNF allows for more efficient exploration in deep reinforcement learning.
2 Reinforcement Learning Background
In reinforcement learning, an agent interacts with its environment which is modelled as a discounted Markov Decision Process with state space , action space , discount factor , transition probabilities mapping state-action pairs to distributions over next states, and reward function (Sutton and Barto, 1998). We denote by the probability of choosing an action in the state under the policy . The action-value function for policy , denoted , represents the expected sum of discounted rewards along the trajectories induced by the MDP and : . The expectation is over the distribution of admissible trajectories obtained by executing the policy starting from and . The action-value function of the optimal policy is and it satisfies the Bellman optimality equation:
Fitted Q iteration (FQI) (Gordon, 1999; Riedmiller, 2005) assumes that the entire learning dataset of agent interactions is available from the start. If represents the dataset consisting of , and represents the weights of the function approximator, then the problem can be formulated as a supervised learning regression problem by minimizing the following:
Deep Q-Networks (DQN) (Mnih et al., 2015) incorporate a deep neural network, parameterized by , as a function approximator for the action-value function of the optimal policy. The neural network parameters are estimated by minimizing the squared temporal difference residual:
where transitions are sampled from a replay buffer of recent observed transitions. Here denotes the parameters of a target network which is updated () regularly and held fixed between individual updates of . The action-value function defines the policy implicitly by .
3 Variational inference for Bayesian neural networks
In order to explore more efficiently, our approach captures the uncertainty of the value estimates using Bayesian inference. Instead of maintaining a point estimate of the deep Q-network parameters, we infer a posterior distribution. However, due to the nonlinear aspect of neural networks, obtaining the posterior distributions is not tractable and approximations have to be introduced. Thus, in this work, we use the variational inference procedure (Hinton and van Camp, 1993) and the so-called reparametrization trick for neural networks (Kingma and Welling, 2013; Rezende et al., 2014).
Variational Inference. Let be a dataset consisting of input output pairs . A neural network parameterized by weights models the conditional probability of an output given an input . Let and be respectively the prior and approximate posterior over weights . Variational Inference (VI) consists of maximizing the following Evidence Lower Bound (ELBO) with respect to the variational posterior parameters
We note that the ELBO is a lower bound on the marginal log-likelihood of the dataset .
Mean Field Approximation. Blundell et al. (2015) assumes a mean field with independent Gaussian distributions for each weight: Let be the weight matrix of a fully connected layer, and where are learned parameters. The uni-modal and the fully factorized Gaussian are both limiting assumptions for high dimensional weights. They are not flexible enough to capture the true posterior distribution which is much more complex. Thus, the accuracy of the model’s uncertainty estimates are potentially compromised.
Multiplicative Normalizing Flows (MNF). Louizos and Welling (2017) use multiplicative noise to define a more expressive approximate posterior. Multiplicative noise is often used as stochastic regularization in training a deterministic neural network, such as Gaussian Dropout (Srivastava et al., 2014). The technique was later reinterpreted as an algorithm for approximate inference in Bayesian neural networks (Kingma et al., 2015). The approximate posterior is as follows:
The approximate posterior is considered an infinite mixture , where plays the role of an auxiliary latent variable (Salimans et al. (2015); Ranganath et al. (2016)). The vector is of much lower dimension () than (). To make the posterior approximation richer and allow arbitrarily complex dependencies between the components of the weight matrix, the mixing density is modeled via normalizing flows (Rezende and Mohamed, 2015). This comes at an additional computational cost that scales linearly in the number of parameters.
Normalizing flows is a class of invertible deterministic transformations for which the determinant of the Jacobian can be computed efficiently. A rich density function can be obtained by applying a invertible transformation on an initial random variable , times, successively. Consider a simple distribution, factorized Gaussian, , the computation is then as follows:
In multiplicative normalizing flows (MNF) (Louizos and Welling, 2017), acts multiplicatively on the mean to the weights as shown in Equation 6. We denote as the learnable posterior parameters which are composed of and Normalizing flow (NF) parameters.
Unfortunately, the divergence term in the ELBO defined in Equation 5 becomes generally intractable as the posterior is an infinite mixture. This is addressed by introducing an auxiliary posterior distribution parameterized by and using it to further lower bound the divergence term of Equation 5. Formally, is parameterized by an inverse normalizing flows as follows:
and where we parameterize as another normalizing flow. The parameters are the learnable auxiliary network parameters which are composed of the parameters of and and the parameters of the inverse normalizing flows . Finally, we obtain the lower bound that MNF should optimize by replacing the divergence term with the lower bound in terms of the distribution :
We can now parameterize the random sampling from in terms of noise variables and , and deterministic function by as described by the following sampling procedure:
where is elementwise multiplication, and are identity matrices and is a dimensional vector whose entries are all equal to one. The lower bound in equation 12 can be written as:
Thus, we can have a Monte Carlo sample of the gradient of with respect to and . This parameterization allows us to handle the approximate parameter posterior as a straightforward optimization problem.
4 Multiplicative Normalizing Flows for Randomized Value Functions
We now turn to our novel proposed approach that uses the techniques we previously introduced. We model the distribution of the expected cumulative discounted reward given the initial state action pair as a Gaussian distribution with parameterized mean 111We overload our notation for both the weight matrix of a single layer and the full set of network parameters. and constant standard deviation : .
In this setting, DQN corresponds to a maximum log-likelihood estimation: from Eq. 3 is:
where and we ignore constant terms. Instead of maintaining a maximum likelihood estimate of Q-value function parameters, we will use a randomized value function to track an approximate posterior distribution of the network parameters , and use the MNF family to parameterize the approximate posterior. The weights are considered as random variables and are obtained by the sampling procedure described in equation 13. The Q-value is now parameterized in terms of the approximate posterior defined in Section 3. Our approach optimizes the ELBO in Equation 12 with respect to the approximate posterior parameters , which amounts to minimizing the following loss:
where and where the noise is disabled for the target network. This loss is amenable to mini-batch optimization. In a supervised learning setting, we would optimize the following cost for every mini-batch, :
where . This makes the regularization cost uniformly distributed among mini-batches at each epoch. In the RL setting, however, we only keep a moving window of experiences in the replay buffer. Thus, the size of replay buffer is not directly analogous to the size of the dataset. As such, the weight is left as a hyper-parameter to tune.
where and are linear transformations. For the auxiliary posterior distribution defined in equation 8, we parameterize the mean and the standard deviation as in the original paper Louizos and Welling (2017).
We call the new agent MNF-DQN. For the update step, the agent’s parameters are drawn from the approximate posterior; MNF-DQN samples the two noise variables and sets following the procedure in equation 13. The current noise samples are held fixed across the mini-batch. MNF-DQN then updates its learnable parameters by performing a gradient descent on the mini-batch loss in equation 17. After updating, MNF-DQN re-samples its parameters from the new approximate posterior distribution and selects actions according to the greedy policy with respect to the sampled Q-value. The full algorithm can be seen in Algorithm 1.
5 Related work
There have been several recent works that incorporate Bayesian parameter updates with deep reinforcement learning for efficient exploration.
Osband et al. (2016a) propose BootstrappedDQN which consists of a simple non-parametric bootstrap with random initialization to approximate a distribution over Q-values. BootstrappedDQN consists of a network with multiple Q-heads. At the start of each episode, the agent samples a head, which it follows for the duration of the episode. BootstrappedDQN is a non-parametric approach to uncertainty estimation. In contrast, MNF-DQN uses a parametric approach, based on variational inference, to quantify the uncertainty estimates.
Azizzadenesheli et al. (2018) extend randomized least-square value iteration by Osband et al. (2016b) (which was restricted to linear approximator) to deep neural networks. In particular, they consider only the last layer as stochastic and keep the remaining layers deterministic. As the last layer is linear, they propose a Bayesian linear regression to update the posterior distribution of its weights in closed form. In contrast, our method is capable of performing an approximate Bayesian update on the full network parameters. Variational inference could be applied for all layers using stochastic gradient descent on the approximate posterior parameters.
The closest work to ours is BBQ-Networks by Lipton et al. (2016). Their algorithm, called Bayes-by-Backprop Q-Network (BBQN), uses variational inference to quantify uncertainty. It uses independent factorized Gaussians as an approximate posterior (Blundell et al., 2015). In our work, we argue that to achieve efficient exploration, we need to capture the true uncertainty of the Q-values. The latter depends crucially on the flexibility of the approximate posterior distribution. We also note that BBQN was proposed for Task-Oriented Dialogue Systems and was not evaluated on standard RL benchmarks. Furthermore, BBQN can be seen simply as a sub-case of MNF-DQN. In fact, the two algorithms are equivalent when we set the auxiliary variable in MNF-DQN to be equal to one.
Our work is also related to methods that inject noise in the parameter space for exploration. For such methods, at the beginning of each episode, the parameters of the current policy are perturbed with some noise. This results in a policy that is perturbed but still consistent for individual states within an episode. This is sometimes called state-dependent exploration (Sehnke et al., 2010) as the same action will be taken every time the same state is sampled in the episode. Recently, Fortunato et al. (2017) proposed to add parametric noise to the parameters of a neural network and show that its aids exploration. The parameters of the noise are learned with gradient descent along with the remaining network weights. Concurrently, Plappert et al. (2017) proposed a similar approach but they rely on heuristics to adapt the noise scale instead of learning it as in Fortunato et al. (2017).
We evaluate the performance of MNF-DQN on two toy domains (N-chain and Mountain Car), as well as several Atari games (Bellemare et al., 2013). We compare the performance of MNF-DQN to several recent state-of-the-art deep exploration methods.
6.1 Toy Tasks
As a sanity check, we evaluate MNF-DQN on the well-known n-chain environment introduced in Osband et al. (2016a). The environment consists of N states. The agent always starts at the second state and has two possible actions: move right and move left. A small reward is received in the first state , a large reward in the final state , otherwise the reward is zero.
We compare the exploration behavior of MNF-DQN, NoisyDQN (Fortunato et al., 2017), BBQN (Lipton et al., 2016) and -greedy DQN on varying chain lengths. We train each agent with ten different random seeds for each chain length. After each episode, agents are evaluated on a single roll-out with all of their randomness disabled ( is set to zero for DQN, noise variables are set to zero for MNF-DQN, BBQN, NoisyDQN). The problem is considered solved when the agent completes the task optimally for one hundred consecutive episodes. While the task is admittedly a simple one, it still requires adequate exploration in order to be solved. This is especially true with large chain lengths, as it is easy to discover the small reward and fall into premature exploitation. Figure 1 shows that MNF-DQN has very consistent performance across different chain lengths. MNF-DQN clearly outperforms -greedy DQN with totally fails for . BBQN performs well but slightly worse than MNF-DQN for very large chain length. MNF-DQN also outperforms NoisyDQN, which on average needs a larger number of episodes to solve the task.
(Moore, 1990) is a classic RL continuous state task where the agent (car) is initially stuck in a valley and the goal is to drive up the mountain on the right. The only way to succeed is to drive back and forth to build up momentum. We use the implementation provided by OpenAI gym (Brockman et al., 2016), where the agent gets reward at every time-step and get reward when it reaches up the mountain, at which point the episode ends. The maximum length of an episode is set to 1000 time-steps. We evaluate the performance of the agent every 10 episodes by using no noise over 5 runs with different random seeds. As shown in the Figure 6.1, MNF-DQN learns much faster and performs better when compared to the other exploration strategies.
6.2 Arcade Learning Environment
Next, we consider a set of Atari games (Bellemare et al., 2013) as a benchmark for high dimensional state-spaces. We compare MNF-DQN to standard DQN with -greedy exploration, BBQN and NoisyDQN. We use the same network architecture for all agents, i.e three convolutional layers and two fully connected layers. For DQN, we linearly anneal from 1.0 to 0.1 over the first 1 million time-steps. For NoisyDQN, the fully connected layers are parameterized as noisy layers and use factorized Gaussian noise as explained in Fortunato et al. (2017). For MNF-DQN and BBQ, in order to reduce computational overhead, we choose to consider only the parameters of the fully connected layers as stochastic variables and perform variational inference on them. We consider the parameters of the convolutional layers as deterministic variables and optimize them using maximum log-likelihood. For MNF-DQN, the normalizing flows are of length two for and , with 50 hidden units for each step of the flow. To have a fair comparison across all algorithms, we fill the replay buffer with actions selected at random for the first 50 thousand time-steps.
We use the standard hyper-parameters of DQN for all agents. MNF-DQN and BBQN have an extra hyper-parameter , the trade-off parameter between the log likely-hood cost and the regularization cost. To tune , we run MNF-DQN and BBQN for where . We train each agent for 40 millions frames. We evaluate each agent on the return collected by the exploratory policy during training steps. Each agent is trained for 5 different random seeds. We plot in Figure 3 the median return as well as the interquartile range.
From Figure 3 we see that across all games our approach provides competitive and consistent results. Moreover, the naive epsilon-greedy approach (DDQN) performs significantly worse than the exploration-based methods in most cases. Of note, MNF provides a boost in performance over the baselines in Gravitar, which is considered a hard exploration and sparse reward game (Bellemare et al., 2016). Meanwhile, BBQN fails completely in the same game. In all the other hard exploration games (Amidar, Alien, Bank Heist, and Qbert), the difference between MNF and the best performing baseline is minimal (within the margin of error).
Through the combination of multiplicative normalizing flows and modern value based deep reinforcement learning methods, we show that a powerful approximate posterior can be efficiently utilized for better exploration. Moreover, the improved sample efficiency comes only at a computational cost that is linear in the number of model parameters. Finally, we find that on several common Deep RL benchmarks, the MNF approximation outperforms state-of-the-art exploration baselines.
- Azizzadenesheli et al. (2018) Azizzadenesheli, K., Brunskill, E., and Anandkumar, A. (2018). Efficient exploration through bayesian deep q-networks. arXiv preprint arXiv:1802.04412.
- Bellemare et al. (2016) Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems.
- Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research.
- Blundell et al. (2015) Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural network. In International Conference on Machine Learning, pages 1613–1622.
- Brafman and Tennenholtz (2002) Brafman, R. I. and Tennenholtz, M. (2002). R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research.
- Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
- Dinh et al. (2016) Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2016). Density estimation using real nvp. arXiv preprint arXiv:1605.08803.
- Fortunato et al. (2017) Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., et al. (2017). Noisy networks for exploration. arXiv preprint arXiv:1706.10295.
- Gordon (1999) Gordon, G. J. (1999). Approximate solutions to markov decision processes. Robotics Institute, page 228.
- Hinton and van Camp (1993) Hinton, G. and van Camp, D. (1993). Keeping neural networks simple by minimising the description length of weights. In Proceedings of COLT-93.
- Houthooft et al. (2016) Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. (2016). Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems.
- Jaksch et al. (2010) Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research.
- Kearns and Singh (2002) Kearns, M. and Singh, S. (2002). Near-optimal reinforcement learning in polynomial time. Machine learning.
- Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016). Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems.
- Kingma et al. (2015) Kingma, D. P., Salimans, T., and Welling, M. (2015). Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583.
- Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Lipton et al. (2016) Lipton, Z. C., Gao, J., Li, L., Li, X., Ahmed, F., and Deng, L. (2016). Efficient exploration for dialogue policy learning with bbq networks & replay buffer spiking. arXiv preprint arXiv:1608.05081.
- Louizos and Welling (2017) Louizos, C. and Welling, M. (2017). Multiplicative normalizing flows for variational bayesian neural networks. In International Conference on Machine Learning, pages 2218–2227.
- Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529.
- Moore (1990) Moore, A. W. (1990). Efficient memory-based learning for robot control.
- Osband et al. (2016a) Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. (2016a). Deep exploration via bootstrapped dqn. In Advances in neural information processing systems.
- Osband et al. (2013) Osband, I., Russo, D., and Van Roy, B. (2013). (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems.
- Osband et al. (2017) Osband, I., Russo, D., Wen, Z., and Van Roy, B. (2017). Deep exploration via randomized value functions. arXiv preprint arXiv:1703.07608.
- Osband et al. (2016b) Osband, I., Van Roy, B., and Wen, Z. (2016b). Generalization and exploration via randomized value functions. In International Conference on Machine Learning.
- Plappert et al. (2017) Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychowicz, M. (2017). Parameter space noise for exploration. arXiv preprint arXiv:1706.01905.
- Ranganath et al. (2016) Ranganath, R., Tran, D., and Blei, D. (2016). Hierarchical variational models. In International Conference on Machine Learning, pages 324–333.
- Rezende and Mohamed (2015) Rezende, D. and Mohamed, S. (2015). Variational inference with normalizing flows. In International Conference on Machine Learning.
- Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286.
- Riedmiller (2005) Riedmiller, M. (2005). Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pages 317–328. Springer.
- Salimans et al. (2015) Salimans, T., Kingma, D., and Welling, M. (2015). Markov chain monte carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, pages 1218–1226.
- Sehnke et al. (2010) Sehnke, F., Osendorfer, C., Rückstieß, T., Graves, A., Peters, J., and Schmidhuber, J. (2010). Parameter-exploring policy gradients. Neural Networks, 23(4):551–559.
- Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958.
- Stadie et al. (2015) Stadie, B. C., Levine, S., and Abbeel, P. (2015). Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814.
- Strehl and Littman (2005) Strehl, A. L. and Littman, M. L. (2005). A theoretical analysis of model-based interval estimation. In Proceedings of the 22nd international conference on Machine learning. ACM.
- Strens (2000) Strens, M. (2000). A bayesian framework for reinforcement learning. In ICML.
- Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Introduction to reinforcement learning, volume 135. MIT Press Cambridge.
- Thompson (1933) Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika.
- van Hasselt et al. (2016) van Hasselt, H., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double q-learning. AAAI Conference on Artificial Intelligence.