Nonparametric Gaussian mixture models for the multiarmed contextual bandit
Abstract
The multiarmed bandit is a sequential allocation task where an agent must learn a policy that maximizes long term payoff, where only the reward of the played arm is observed at each iteration. In the stochastic setting, the reward for each action is generated from an unknown distribution, which depends on a given ‘context’, available at each interaction with the world. Thompson sampling is a generative, interpretable multiarmed bandit algorithm that has been shown both to perform well in practice, and to enjoy optimality properties for certain reward functions. Nevertheless, Thompson sampling requires sampling from parameter posteriors and calculation of expected rewards, which are possible for a very limited choice of distributions. We here extend Thompson sampling to more complex scenarios by adopting a very flexible set of reward distributions: nonparametric Gaussian mixture models. The generative process of Bayesian nonparametric mixtures naturally aligns with the Bayesian modeling of multiarmed bandits. This allows for the implementation of an efficient and flexible Thompson sampling algorithm: the nonparametric model autonomously determines its complexity in an online fashion, as it observes new rewards for the played arms. We show how the proposed method sequentially learns the nonparametric mixture model that best approximates the true underlying reward distribution. Our contribution is valuable for practical scenarios, as it avoids stringent model specifications, and yet attains reduced regret.
1 Introduction
Recent advances in reinforcement learning (13) have sparked renewed interest in sequential decision making. The aim of sequential decision making is to optimize interactions with the world (exploit) while simultaneously learning how the world operates (explore). Its origins can be traced back to the beginning of the past century, with important contributions within the field of statistics by Thompson (40) and later Robbins (28). The multiarmed bandit (MAB) problem is a natural abstraction for a wide variety of realworld challenges that require learning while simultaneously maximizing rewards.
The name “bandit” finds its origin in the playing strategy one must devise when facing a row of slot machines. The contextual setting, where at each interaction with the world side information (known as ‘context’) is available, is a natural extension of the bandit problem. Recently, a renaissance of the study of MAB problems has flourished (33, 2, 22). The performance of algorithms for contextual bandits with linear payoffs (29, 10) has been widely studied in the last decade (18, 7, 1). Furthermore, it has attracted interest from industry, due to its impact in digital advertising and products (20, 8).
Thompson Sampling (39, 32) and its generalization known as posterior sampling, provide an elegant approach that tackles the explorationexploitation dilemma. It updates a posterior over expected rewards, and chooses actions based on the probability that they are optimal. It has been empirically and theoretically proven to perform competitively for many MAB models (8, 34, 3, 4, 16, 30, 31). Besides, its applicability to the more general reinforcement learning setting of Markov Decision Processes (6) has recently tracked momentum (12, 25).
Thompson sampling and the Bayesian modeling of the MAB problem facilitate not only generative and interpretable modeling, but sequential and batch processing algorithms as well. Within this framework, one only requires access to posterior samples of the model. Unfortunately, maintaining such a posterior is intractable for distributions not in the exponential family (16). As such, developing practical MAB methods to balance exploration and exploitation in complex domains remains largely unsolved.
In an effort to extend Thompson sampling to more complex scenarios, researchers have considered other flexible reward functions and Bayesian inference. For example, recent approaches have embraced approximate Bayesian neural networks for Thompson Sampling. Neural networks have proven to be powerful function approximators, and approximate Bayesian inference provides posterior uncertainty estimates. To that end, variational methods, stochastic minibatches, and Monte Carlo techniques have been studied for uncertainty estimation of posteriors (5, 15, 21, 24, 19). In a recent benchmark of such techniques (27), it was reported that even if successful in the supervised learning setting, they underperform in the MAB scenario. In particular, Riquelme et al. (27) emphasize the issue of adapting the slow convergence uncertainty estimates of neural net based methods to the bandit setting.
In parallel, others have focused on extending Thompson sampling by targeting alternative classes of reward functions. Some have focused on approximating the unknown bandit reward functions with Gaussian mixture models (42), while others have assumed a Gaussian process reward distribution (36, 14, 17). The latter are powerful nonparametric methods for modeling distributions over nonlinear continuous functions (26). Unfortunately, standard Gaussian processes are computationally demanding, as they scale cubically in the number of observations, limiting their applicability to small datasets and the online setting (even if advancements such as pseudoobservations (35) or variational inference (41) have been proposed to mitigate these shortcomings).
In this paper, we combine the large hypothesis space of mixture models — which can approximate any continuous reward distribution — with the flexibility of Bayesian nonparametrics (11). In many contexts, a countably infinite mixture is a very realistic model to assume, and has been shown to succeed in modeling a diversity of phenomena. Within the Bayesian framework, one uses prior distributions over the mixing proportions, such as Dirichlet or PitmanYor processes (37), which allow for inference of the appropriate complexity of a model from observed data. These models describe mixtures in which one not only does not explicitly specify the number of mixtures, but allows the possibility of an unbounded number of mixtures. Bayesian nonparametrics support a wide class of models, yet have analytically tractable inference and online update rules.
Our contribution here is on exploiting Bayesian nonparametric mixture models for Thompson sampling to perform MAB optimization. This provides a new flexible framework for solving a rich class of MAB problems. We model the complex mapping between the observed rewards and the unknown parameters of the generating process with nonparametric Gaussian mixture models. For learning such a nonparametric distribution within the contextual multiarmed bandit setting, we leverage the advances in Markov Chain Monte Carlo methods for Bayesian nonparametric models (23).
Mixtures of distributions provide a powerful approach for nonparametric density estimation, and the generative interpretation of Bayesian nonparametric models corresponds to the sequential nature of the MAB problem as well. The proposed method learns the nonparametric mixture model that best approximates the true underlying reward distribution, adjusting its complexity as it sequentially observes additional data. To the best of our knowledge, no other work uses Bayesian nonparametric mixture models to address the contextual MAB.
2 Background
2.1 Multiarmed bandits
A multiarmed bandit is a real time sequential decision process in which, at each iteration, an agent is asked to select an action according to a policy which maximizes the accumulated rewards over time, balancing exploitation and exploration. In the contextual case, one must decide which arm to play next (i.e., pick ), based on the available context, e.g., . At every iteration , the observed reward is independently drawn from the unknown reward distribution corresponding to the played arm, conditioned on the context and parameterized by unknown ; i.e., . Due to the stochastic nature of the bandit, one summarizes each arm’s reward via its conditional expectation for that context .
When the properties of the arms (i.e., the parameters ) are known, one can readily determine the optimal selection policy as soon as the context is given, i.e.,
(1) 
The challenge in the contextual MAB problem is not knowing the true reward parameters or, more generally, the lack of knowledge about the rewardgenerating model. Thus, one needs to simultaneously (1) learn the properties of the reward distribution and (2), decide which action to take sequentially. The next arm to play is chosen based upon the history observed, with the goal of maximizing the expected (cumulative) reward. Previous history contains the set of given contexts, played arms, and observed rewards up to time , denoted as , with , and .
Among the many alternatives to address this class of problems, Thompson sampling is particularly appealing, due to its generative formulation and its connection with Bayesian modeling. Furthermore, it has been shown to perform empirically well and has sound theoretical bounds, for both contextual and contextfree problems (8, 3, 4, 16, 30, 31).
Thompson sampling chooses what arm to play in proportion to its probability of being optimal, i.e.,
(2) 
where is the optimal arm given the true parameters and the observed context, i.e., . If the parameters of the model are known, the above expression becomes deterministic, as one always picks the arm with the maximum expected reward
(3) 
where denotes the indicator function
(4) 
When the parameters of the model are unknown, one needs to explore ways of computing the probability of each arm being optimal. In a Bayesian setting, the parameters are modeled as random variables with priors. Specifically, one marginalizes over the posterior probability distribution of the parameters, after observing rewards and actions up to time instant , i.e.,
(5) 
The above integral can not be solved exactly, even when the parameter posterior update is analytically tractable. Therefore, when reward distributions that are not within the exponential family are considered, one must resort to approximations of the posterior. In the following, we propose nonparametric mixture models as tractable yet performant reward distributions for the MAB.
2.2 Bayesian nonparametric mixture models
Bayesian nonparametric models provide a powerful density estimation framework that adjust model complexity in response to the data observed. The combination of mixture models with Bayesian nonparametrics embodies a large hypothesis space, which can arbitrarily approximate continuous reward distributions. Bayesian nonparametric mixture models describe countably infinite mixture distributions, which are very flexible assumptions suited for many practical settings. We refer to (11) for a detailed review of standard nonparametric models and how they can be used in practice.
A variety of Bayesian nonparametric alternatives have been studied in literature. We here focus on the PitmanYor model, which is a stochastic process whose sample path is a probability distribution. It is a generalization of Bayesian nonparametric models from where a drawn random sample is an infinite discrete probability distribution. In the following, we succinctly summarize the generative process and the basics for its inference.
A PitmanYor mixture model (37), with a discount parameter and a concentration parameter , is described by the following generative process:

Mixture parameters are drawn from the PitmanYor process, i.e., , where . Equivalently, the process can be described as
(6) where refers to all the available observations, and to the number of observations assigned to mixture .

The observation is drawn from the emission distribution parameterized by its corresponding parameters, i.e., .
For parametric measures, we write and , where are the prior hyperparameters of the emission distribution, and are the posterior parameters after observations.
We note that the Dirichlet process can be readily obtained from Eqn. (6) by using . The discount parameter gives the PitmanYor process more flexibility over tail behavior (the Dirichlet process has exponential tails, whereas the PitmanYor can have powerlaw tails).
For analysis and inference of these models, one incorporates auxiliary mixture variables . These are categorical variables, where , if observation is drawn from mixture . The joint posterior of these assignments follows, for ,
(7) 
where indicates the number of observations drawn from mixture and . The full joint likelihood of assignments and observations is
(8) 
For inference of the above model given observations , one can derive a Gibbs sampler that iterates between mixture assignment sampling and posterior updates of the emission distribution parameters (Teh and Jordan (37) provide a detailed explanation of the procedure).
The conditional distributions of observation assignments to already drawn mixtures , and a new unseen mixture are
(9) 
Given these mixture assignments, one updates the parameter posteriors conditioned on and observations , based on the specific choices of emission distribution and priors: . These also determine the computation of the predictive distribution for solving Eqn. (9). For analytical convenience, one usually resorts to emission distributions with their conjugate priors.
3 Proposed method
We now describe how to combine Bayesian nonparametric mixture models with Thompson sampling for the MAB setting. The graphical model of the Bayesian nonparametric MAB is rendered in Fig. 1. We consider a completely independent set of nonparametric mixture models per arm, with their own hyperparameters and .
As shown in Fig. 1, we assume complete independence of perarm reward distributions, i.e., each arm of the bandit is allowed to follow a different family of distributions. We consider this setting to be a very powerful extension of the MAB problem, which has not attracted much interest so far.
An alternative would be to consider a hierarchicalnonparametric model (38, 37), where all arms are assumed to obey the same family of distributions, but only their mixture proportions are allowed to vary across arms. The main advantage of this alternative is that one would learn parameter posteriors from rewards of all played arms, with the disadvantage of all arms being limited to the same family of reward distributions. We illustrate this alternative hierarchical nonparametric MAB, and provide details of the model and its inference, in Appendix A.
In order to approximate any continuous reward distribution, we study nonparametric Gaussian mixtures as a flexible formulation for modeling complex MAB reward densities.
We focus on contextconditional Gaussian emission distributions , which are parameterized perarm and permixture; i.e., . The conjugate prior for such emission distribution is a Normalinverse Gamma, with hyperparameters , i.e.,
(10) 
After observing rewards , and conditioned on assignments , the posteriors of the parameters per arm and mixture also follow a Normalinverse Gamma distribution.
The updated hyperparameters of such posterior depend on the number of rewards observed after playing arm that are assigned to mixture :
(11) 
where is a sparse diagonal matrix with elements , and the number of rewards observed after playing arm .
Finally, the predictive emission distribution after marginalization of the parameters , needed for solving Eqn. (9), follows a conditional Studentt distribution
(12) 
The hyperparameters used above are those of the prior () or the posterior (), depending on whether the predictive density refers to a new mixture , or a “seen” mixture for which observations have been already assigned to, respectively.
Similarly, the likelihood of a set of rewards assigned to a perarm mixture , , given their associated contexts , follows the matrix tdistribution
(13) 
3.1 Thompson sampling for nonparametric Gaussian mixture models
We now describe our proposed Thompson sampling technique for multiarmed contextual bandits with nonparametric Gaussian mixture reward models. To that end, we leverage the Bayesian generative process described above, and infer the posteriors over the parameters, in order to implement a posterior sampling based policy (30).
In the MAB problem, the agent needs to decide which arm to play next, based on the information available at that iteration. In a randomized probability matching technique, each arm is picked based on its probability of being optimal. However, since the integral in Eqn. (5) is intractable, Thompson (40) sampling draws a random parameter sample from the posterior instead, and picks the action that maximizes the expected reward, given that parameter sample. That is,
(14) 
In the proposed model, we sample perarm and permixture Gaussian parameters from the posterior hyperparameter distributions with updated , conditioned on the mixture assignments determined by the Gibbs sampler in Eqn. (9). The Gibbs sampler is run until a stopping criteria is met (i.e., the model likelihood of the sampled MCMC chain is stable within an margin between steps, or a maximum number of iterations is reached).
With the sufficient statistics of these assignments (i.e., the counts of rewards observed for arm and assigned to mixture ), and the posterior parameter samples , one computes the expected reward for each arm of the bandit as follows:
(15) 
This leads to the proposed nonparametric Gaussian mixture model Thompson sampling for the contextual MAB problem in Algorithm 1 .
4 Evaluation
In this section, we evaluate the performance of the proposed nonparametric mixture model Thompson sampling technique. First, we validate that the method performs as expected in the simplest case, i.e., when dealing with a contextual linear Gaussian MAB.
We evaluate different parameterizations of two and threearmed linear contextual bandits, with uniform and uncorrelated 2dimensional context, i.e., . We provide results for a specific parameterization of these contextual Gaussian bandits in Fig. 2, where we observe the flexibility of nonparametric mixture models in action (similar results were obtained for other bandit parameterizations, see Appendix B).
We show how the proposed method is able to provide as good regret performance as a Thompson Sampling method that is aware of the true underlying reward distribution. That is, the nonparametric Gaussian mixture model is able to accurately fit the mixture to the correct underlying distribution, so that the regret performance of the proposed Thompson sampling is optimal. These results serve as a validation of the quality of the nonparametric mixture model assumption, as the performance loss of the proposed bandit is negligible: the nonparametric Thompson sampling method is as good as the analytical alternative.
Furthermore, note how the Gibbs sampling inference aligns well with the online nature of the bandit, as the inference is recomputed only with the reward observed for the last played arm. Even more, because of the incremental availability of observations, the Gibbs sampler achieves convergence (as described in section 3.1) in few iterations (in our experiments, less than 5 steps where usually required to achieve a 1% loglikelihood relative difference between steps). Such a low computational burden is possible because the Gibbs sampler is run, at each interaction with the world, from a good starting point: i.e., the parameter space that describes all but this newly observed reward.
We now study more challenging cases, i.e., those were the underlying reward distributions do not fit into the exponential family assumption. We focus on the following scenarios:
(16) 
(17) 
(18) 
The reward distributions of the contextual bandits in all the above are Gaussian mixtures dependent on a two dimensional uncorrelated uniform context, i.e., , , . These reward distributions are complex in that they are multimodal and, in Scenario B and Scenario C, unbalanced. The scenarios differ in the amount of mixture overlap and the similarity between arms. Recall the complexity of the reward distributions in Scenario B, with a significant overlap between arm rewards and the unbalanced nature of arm 1. Furthermore, Scenario C describes a MAB with different perarm reward distributions: a linear Gaussian distribution for arm 0, a bimodal Gaussian mixture for arm 1, and an unbalanced Gaussian mixture with three components for arm 2.
Fig. 3 shows the cumulative regret of the proposed nonparametric mixture model Thompson sampling approach in all scenarios. We compare the performance of our method to that of an oracle Thompson sampling approach that knows the true dimensionality of the problem (i.e., the number of underlying mixtures ). Note that this is only possible in a simulated scenario, as knowing the reward complexity of a MAB beforehand is impractical (an alternative would be to run multiple model assumptions in parallel, with a subsequent model selection).
Fig. 3 shows the full power and flexibility of the proposed nonparametric Gaussian mixture model based Thompson sampling. Due to the capacity of Bayesian nonparametrics to autonomously adjust the complexity of the model to the sequentially observed data, the proposed method not only fits the underlying reward function accurately, but also attains reduced regret. This is achieved in the most challenging MAB scenarios (i.e., different perarm distributions not in the exponential family), and with no parameter tuning ( and have been used in this experiments).
Finally, we evaluate the proposed method in a real application, i.e., the recommendation of personalized news articles, in a similar fashion as done by Chapelle and Li (8). Online content recommendation represents an important example of reinforcement learning, as it requires efficient balancing of the exploration and exploitation tradeoff.
We use the R6A  Yahoo! Front Page Today Module User Click Log Dataset^{1}^{1}1Available at https://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=49, which contains a fraction of user click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page during the first ten days in May 2009. The articles to be displayed were originally chosen uniformly at random from a handpicked pool of highquality articles. For our evaluation, we picked 2 subsets of 20 articles shown in May 4th and 5th, with a total of 75779 and 77308 logged user interactions, respectively.
The goal is to choose the most interesting article to users, evaluated by counting the total number of clicks. In the dataset, each user is associated with six features, a bias term and 5 features that correspond to the membership features constructed via the conjoint analysis with a bilinear model described in (9).
We treat each article as an arm (), and the reward is whether the article is clicked or not by the user (). We pose the problem as a MAB, where we want to maximize the average clickthrough rate (CTR) on the recommended articles. We implemented both the proposed nonparametric Gaussian mixture model, and the logistic reward model as proposed by Chapelle and Li (8), with the ImportanceSampling based implementation of Urteaga and Wiggins (43).
Summary CTR results are provided in Table 1, for both evaluated reward bandit models. Observe the flexibility of the nonparametric mixture model, as it is able to attain an overall improved CTR rate.
May 4th  May 5th  
CTR  Normalized CTR  CTR  Normalized CTR  
Logistic  0.0451 +/ 0.0068  1.0855 +/ 0.1794  0.0462 +/ 0.0054  1.0472 +/ 0.1486  
Nonparametric mixture model  0.0474 +/ 0.0044  1.1413 +/ 0.1381  0.0483 +/ 0.0038  1.0932 +/ 0.1098  
Model 
5 Conclusion
With this work, we contribute to the field of reinforcement learning by proposing a nonparametric mixture model based Thompson sampling framework. We merge the advances in the field of Bayesian nonparametrics with a state of the art MAB policy (i.e., Thompson sampling), and allow its extension to complex domains.
The proposed Bayesian algorithm provides interpretable and flexible modeling of convoluted reward functions, balancing the explorationexploitation tradeoff in complex domains. Empirical results show good cumulative regret performance of the proposed nonparametric Thompson sampling in simulated challenging models, remarkably adjusting to the complexity of the underlying bandit in an online fashion. With the ability to sequentially learn the nonparametric mixture model that best approximates the true reward distribution, the proposed method attains reduced regret. Our contribution is valuable for practical scenarios, as it avoids stringent model specifications. A future application is to practical scenarios where complex models are likely to outperform simpler parameterized models in describing real data.
5.1 Software and Data
The implementation of the proposed method is available in this public repository. It contains all the software required for replication of the findings of this study.
Acknowledgments
This research was supported in part by NSF grant SCH1344668.
References
 AbbasiYadkori et al. (2011) Y. AbbasiYadkori, D. Pál, and C. Szepesvári. Improved Algorithms for Linear Stochastic Bandits. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2312–2320. Curran Associates, Inc., 2011. URL https://papers.nips.cc/paper/4417improvedalgorithmsforlinearstochasticbandits.
 Agrawal and Goyal (2011) S. Agrawal and N. Goyal. Analysis of Thompson Sampling for the multiarmed bandit problem. CoRR, abs/1111.1797, 2011.
 Agrawal and Goyal (2012a) S. Agrawal and N. Goyal. Thompson Sampling for Contextual Bandits with Linear Payoffs. CoRR, abs/1209.3352, 2012a.
 Agrawal and Goyal (2012b) S. Agrawal and N. Goyal. Further Optimal Regret Bounds for Thompson Sampling. CoRR, abs/1209.3353, 2012b.
 Blundell et al. (2015) C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight Uncertainty in Neural Networks. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, pages 1613–1622. JMLR.org, 2015.
 Burnetas and Katehakis (1997) A. N. Burnetas and M. N. Katehakis. Optimal Adaptive Policies for Markov Decision Processes. Mathematics of Operations Research, 22(1):222–255, 1997. doi: 10.1287/moor.22.1.222.
 CesaBianchi and Kakade (2011) N. CesaBianchi and S. Kakade. An Optimal Algorithm for Linear Bandits. ArXiv eprints, Oct. 2011.
 Chapelle and Li (2011) O. Chapelle and L. Li. An Empirical Evaluation of Thompson Sampling. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2249–2257. Curran Associates, Inc., 2011.
 Chu et al. (2009) W. Chu, S.T. Park, T. Beaupre, N. Motgi, A. Phadke, S. Chakraborty, and J. Zachariah. A Case Study of Behaviordriven Conjoint Analysis on Yahoo!: Front Page Today Module. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 1097–1104, New York, NY, USA, 2009. ACM. ISBN 9781605584959. doi: 10.1145/1557019.1557138.
 Chu et al. (2011) W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual Bandits with Linear Payoff Functions. In G. Gordon, D. Dunson, and M. Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 208–214, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR. URL http://proceedings.mlr.press/v15/chu11a.html.
 Gershman and Blei (2012) S. J. Gershman and D. M. Blei. A tutorial on Bayesian nonparametric models. Journal of Mathematical Psychology, 56(1):1 – 12, 2012. ISSN 00222496. doi: https://doi.org/10.1016/j.jmp.2011.08.004.
 Gopalan and Mannor (2015) A. Gopalan and S. Mannor. Thompson Sampling for Learning Parameterized Markov Decision Processes. In P. Grünwald, E. Hazan, and S. Kale, editors, Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 861–898, Paris, France, 03–06 Jul 2015. PMLR. URL http://proceedings.mlr.press/v40/Gopalan15.html.
 Gosavi (2009) A. Gosavi. Reinforcement learning: A tutorial survey and recent advances. INFORMS Journal on Computing, 21(2):178–192, 2009.
 Grünewälder et al. (2010) S. Grünewälder, J.Y. Audibert, M. Opper, and J. ShaweTaylor. Regret Bounds for Gaussian Process Bandit Problems. In Y. W. Teh and M. Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 273–280, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL http://proceedings.mlr.press/v9/grunewalder10a.html.
 Kingma et al. (2015) D. P. Kingma, T. Salimans, and M. Welling. Variational Dropout and the Local Reparameterization Trick. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2575–2583. Curran Associates, Inc., 2015.
 Korda et al. (2013) N. Korda, E. Kaufmann, and R. Munos. Thompson Sampling for 1Dimensional Exponential Family Bandits. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1448–1456. Curran Associates, Inc., 2013.
 Krause and Ong (2011) A. Krause and C. S. Ong. Contextual Gaussian Process Bandit Optimization. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2447–2455. Curran Associates, Inc., 2011.
 Langford and Zhang (2008) J. Langford and T. Zhang. The EpochGreedy Algorithm for Multiarmed Bandits with Side Information. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 817–824. Curran Associates, Inc., 2008. URL https://papers.nips.cc/paper/3178theepochgreedyalgorithmformultiarmedbanditswithsideinformation.
 Li et al. (2016) C. Li, C. Chen, D. Carlson, and L. Carin. Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 1788–1794. AAAI Press, 2016.
 Li et al. (2010) L. Li, W. Chu, J. Langford, and R. E. Schapire. A ContextualBandit Approach to Personalized News Article Recommendation. CoRR, abs/1003.0146, 2010.
 Lipton et al. (2016) Z. C. Lipton, X. Li, J. Gao, L. Li, F. Ahmed, and L. Deng. Efficient Dialogue Policy Learning with BBQNetworks. ArXiv eprints, Aug. 2016.
 Maillard et al. (2011) O.A. Maillard, R. Munos, and G. Stoltz. FiniteTime Analysis of Multiarmed Bandits Problems with KullbackLeibler Divergences. In Conference On Learning Theory, 2011.
 Neal (2000) R. M. Neal. Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of Computational and Graphical Statistics, 9(2):249–265, 2000. ISSN 10618600.
 Osband et al. (2016) I. Osband, C. Blundell, A. Pritzel, and B. V. Roy. Deep Exploration via Bootstrapped DQN. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4026–4034. Curran Associates, Inc., 2016.
 Ouyang et al. (2017) Y. Ouyang, M. Gagrani, A. Nayyar, and R. Jain. Learning Unknown Markov Decision Processes: A Thompson Sampling Approach. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1333–1342. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6732learningunknownmarkovdecisionprocessesathompsonsamplingapproach.pdf.
 Rasmussen and Williams (2005) C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2005.
 Riquelme et al. (2018) C. Riquelme, G. Tucker, and J. Snoek. Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling. In International Conference on Learning Representations, 2018.
 Robbins (1952) H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, (58):527–535, 1952.
 Rusmevichientong and Tsitsiklis (2010) P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, May 2010. ISSN 0364765X. doi: 10.1287/moor.1100.0446.
 Russo and Roy (2014) D. Russo and B. V. Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
 Russo and Roy (2016) D. Russo and B. V. Roy. An informationtheoretic analysis of Thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471, 2016.
 Russo et al. (2018) D. J. Russo, B. V. Roy, A. Kazerouni, I. Osband, and Z. Wen. A Tutorial on Thompson Sampling. Foundations and Trends^{®} in Machine Learning, 11(1):1–96, 2018. ISSN 19358237. doi: 10.1561/2200000070. URL http://dx.doi.org/10.1561/2200000070.
 Scott (2010) S. L. Scott. A modern Bayesian look at the multiarmed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010. ISSN 15264025. doi: 10.1002/asmb.874.
 Scott (2015) S. L. Scott. Multiarmed bandit experiments in the online service economy. Applied Stochastic Models in Business and Industry, 31:37–49, 2015. Special issue on actual impact and future perspectives on stochastic modelling in business and industry.
 Snelson and Ghahramani (2006) E. Snelson and Z. Ghahramani. Sparse Gaussian Processes using Pseudoinputs. In Y. Weiss, B. Schölkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1257–1264. MIT Press, 2006.
 Srinivas et al. (2010) N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 1015–1022, USA, 2010. Omnipress. ISBN 9781605589077.
 Teh and Jordan (2010) Y. W. Teh and M. I. Jordan. Hierarchical Bayesian nonparametric models with applications. Bayesian nonparametrics, 1, 2010.
 Teh et al. (2006) Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006. doi: 10.1198/016214506000000302.
 Thompson (1933) W. R. Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 25(3/4):285–294, 1933. ISSN 00063444.
 Thompson (1935) W. R. Thompson. On the Theory of Apportionment. American Journal of Mathematics, 57(2):450–456, 1935. ISSN 00029327, 10806377.
 Titsias (2009) M. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In D. van Dyk and M. Welling, editors, Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages 567–574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR.
 Urteaga and Wiggins (2018a) I. Urteaga and C. Wiggins. Variational inference for the multiarmed contextual bandit. In A. Storkey and F. PerezCruz, editors, Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 698–706, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018a. PMLR.
 Urteaga and Wiggins (2018b) I. Urteaga and C. Wiggins. (Sequential) Importance Sampling Bandits. ArXiv eprints, Sept. 2018b.
Appendix A Nonparametric Hierarchical Mixture models
The generative process of a PitmanYor mixture model follows:

.

, for

, that is
(19) 
where refer to the per assignments to local clusters , each with mixture assignment . For parametric measures, we write and , where are the prior hyperparameters of the distribution and , the posterior parameters after observations. Note again that the Hierarchical Dirichlet process is a particular case of the above with .
The Gibbs sampler for inference of the above model after observations relies on the conditional distribution of observation assignments to local clusters ,
(20) 
and mixture assignments for a local cluster:
(21) 
(22) 
where with we refer to all but those assigned to local cluster in set .
The alternative hierarchical nonparametric mixture model MAB is illustrated in Fig. 4.