Bayesian Unification of Gradient and Banditbased Learning for Accelerated Global Optimisation
Abstract
Bandit based optimisation schemes have a remarkable advantage over gradient based approaches due to their global perspective, which eliminates the danger of getting stuck at local optima. However, for continuous optimisation problems or problems with a large number of actions, bandit based approaches can be hindered by slow learning. Gradient based approaches, on the other hand, navigate quickly in highdimensional continuous spaces through local optimisation, following the gradient in fine grained steps. However, apart from being susceptible to local optima, these schemes are also less suited for online learning due to their reliance on extensive trialanderror before the optimum can be identified. In contrast, bandit algorithms seek to identify the optimal action (global optima) in as few steps as possible. In this paper, we propose a Bayesian approach that unifies the above two distinct paradigms in one single framework, with the aim of combining their advantages. At the heart of our approach we find a stochastic linear approximation of the function to be optimised, where both the gradient and values of the function are explicitly captured. This model allows us to learn from both noisy function and gradient observations, as well as predicting these properties across the action space to support optimisation. We further propose an accompanying bandit driven exploration scheme that uses Bayesian credible bounds to trade off exploration against exploitation. Our empirical results demonstrate that by unifying bandit and gradient based learning, one obtains consistently improved performance across a wide spectrum of problem environments. Furthermore, even when gradient feedback is unavailable, the flexibility of our model, including gradient prediction, still allows us outperform competing approaches, although with a smaller margin. Due to the pervasiveness of bandit based optimisation, our scheme opens up for improved performance both in metaoptimisation and in applications where gradient related information is readily available.
1 Introduction
1.1 Background and Motivation
The multiarmed bandit problem is a classical optimisation problem that captures the trade off between exploitation and exploration in reinforcement learning. The problem consists of an agent that sequentially pulls one out of multiple arms attached to a gambling machine, with each pull resulting in a scalar reward. Each reward is randomly drawn from an unknown distribution, unique to each arm. The purpose is to as quickly as possible identify the arm with the highest expected reward, through goal directed trialanderror.
Bandit based optimisation schemes have a tremendous advantage over gradient based approaches (such as [1]) due to their global perspective, which eliminates the danger of getting stuck at local optima. However, for continuous optimisation problems or problems with a large number of arms (actions), bandit based approaches are hindered by their inability to generalise across arms (typically modelling arms as independent reward sources). This independence assumption leads to slow learning because the expected reward function must be inferred independently for each arm. Gradient based approaches, on the other hand, navigate quickly in highdimensional continuous spaces through local optimisation, following the gradient in small steps. The local optimisation, however, makes this class of schemes susceptible to local optima. Further, they are less suited for online learning due to their reliance on extensive trial and error, with small parameter adjustments at each step. In contrast, bandit algorithms are designed for online operation, aiming to converge to the optimal arm (global optima) in as few trials as possible.
To deal with continuous and large action spaces, several bandit based approaches have recently been proposed that capture interaction among actions. One class of schemes, referred to as global multiarmed bandit schemes, models the expected rewards of the arms as (non)linear functions of a global parameter [2]. Another family of techniques attacks large action spaces through tree based searching, with XArmed Bandits finding global maxima when the expected reward (objective) function is ”locally Lipschitz” [3]. Finally, Gaussian processes have been applied for smoothing and interpolation, forming the foundation for bandit based exploration and exploitation in continuous action spaces [4].
Gaussian process based approaches are particularly attractive because they provide a Bayesian estimate of the expected reward functions including credible intervals, as illustrated in Fig. 1.
In brief, dedicated kernel functions capture smoothness and other function dynamics, encoded in a covariance matrix. However, as illustrated in the figure, the scheme is ”blind” towards the gradient of the underlying reward function (red line), merely ”tracing” a line through the observations (crosses), tending towards a prior mean without other input (typically set to zero).
In conclusion, gradient and bandit based scheme have distinct advantages and disadvantages.
1.2 Paper Contributions and Outline
In this paper we propose a radically new approach to stochastic optimisation where the global perspective of multiarmed bandits is combined with gradient based local optimisation, with the effect of significantly accelerating learning. In all brevity, our approach provides a Bayesian Unification of Gradient and Banditbased learning (hereafter referred to as BUGB). Our contributions can be summarised as follows:

At the heart of BUGB we find a novel Bayesian model that explicitly connects the expected reward function with its gradient. The model supports learning from both noisy function values as well as gradient related observations. Further, unobserved function values and gradient information can be predicted across the action space to support goaldirected exploration and exploitation of the reward function.

Our empirical results demonstrate that by unifying bandit and gradient based learning, one obtains improved performance across a wide spectrum of reward functions and degrees of noise.

Even when gradient feedback is unavailable, the flexibility of our model, including gradient estimation, allows us to still outperform competing approaches, although with a smaller margin.
In Section 2 we provide the details of our Bayesian approach to unifying gradient and banditbased learning. We introduce a grid based linear approximation of the reward function that explicitly relates function and gradient values, modelled as a set of stochastic variables to address noisy observations and relationships. We then cover accompanying optimisation strategies based on Bayesian credibility bounds as well as Thompson Sampling, before we in Section 3 provide empirical results demonstrating the superiority of our scheme in a wide range of settings. We conclude the paper in Section 4 by providing pointers for further research.
2 Bayesian Unification of Gradient and Banditbased Learning (BUGB)
2.1 The BUGB Model
The BUGB model is based on a linear approximation of the expected reward function using a grid of input values, . This paper focuses on onedimensional cases. The approximation then takes the following recursive form:
for .
As illustrated in Fig. 2, for any input value , the function value, , is formulated in terms of the gradient, , and the function value, , of the preceding point, . Note that for , then becomes a constant. In all brevity, the gradient and function values are related in a manner that allows the underlying function to be approximated with arbitrary accuracy. That is, the approximation can be made arbitrarily accurate by making the grid increasingly fine grained: (for any differentiable function ).
We are now ready to present our novel Bayesian scheme that explicitly connects the expected reward function with its gradient.
As shown in Fig. 3, we model each and as stochastic variables and , respectively. These stochastic variables are normally distributed, with corresponding unknown means, and , and variations, and . As further seen in the figure, the relationship between variables are defined recursively, according to the aforementioned linear approximation scheme: . Here, captures uncertainty, representing i.i.d. Gaussian noise . Furthermore, we model the dynamics of the unknown gradient of by relating neighbouring stochastic gradient variables: . That is, the change rate is stochastically governed by i.i.d. Gaussian noise .
Using a factor graph based computation approach for the above model, we can efficiently calculate the posterior joint and marginal distributions for all the variables, given noisy information on both function and gradient values (the computational complexity grows linearly with the number of grid points).
2.2 Optimization Strategies with Thompson Sampling and Upper Confidence Intervals
From a rather broad perspective, there are currently two competing strategies for finding the global optimum in the bandit setting: Thompson sampling (stochastic probability matching schemes) and those based on upper confidence (or credibility) bounds. Thompson sampling tends to provide better performance than UCBbased approaches in empirical investigations, however, is known to overexplore. UCBlike approaches, on the other hand, provide a deterministic and more goaldirected path towards the global optimum, finding the optimum with probability arbitrarily close to unity. Thompson sampling, on the other hand, always converges to the global optimum (with unit probability). [5]
In [5] we proposed a Bayesian technique for solving bandit like problems, akin to the Thompson Sampling [8] principle, leading to novel schemes for handling multiarmed and nonstationary (restless) bandit problems [9, 10]. Empirical results demonstrated the advantages of these techniques over established top performers. Furthermore, we provided theoretical results stating that the original technique is instantaneously selfcorrecting and that it converges to only pulling the optimal arm with probability as close to unity as desired. Later on, as a further testimony to the renewed importance of the Thompson Sampling principle, a modern Bayesian look at the multiarmed bandit problem was also taken in [6, 7].
A promising avenue for solving the multiarmed bandit problem involves the methods which consider the estimation of confidence intervals, wherein the scheme estimates a confidence interval for the reward probability of each arm, and an “optimistic reward probability estimate” is identified for each arm. The arm with the most optimistic reward probability estimate is then greedily selected [11, 12].
In [13], the authors analysed several confidence interval based algorithms. These algorithms also provide logarithmically increasing regret, with UCBTuned – a variant of the wellknown UCB1 algorithm — outperforming both UCB1, UCB2, as well as the greedy strategy. In brief, in UCBTuned, the following optimistic estimates are used for each arm :
(1) 
where and are the sample mean and variance of the rewards that have been obtained from arm , is the total number of arm pulls, and is the number of times arm has been pulled. Thus, the quantity added to the sample average of a specific arm is steadily reduced as the arm is pulled, and the corresponding uncertainty about the reward probability is reduced. As a result, by always selecting the arm with the highest optimistic reward estimate, UCBTuned gradually shifts from exploration to exploitation.
By providing a Bayesian estimate of the function to be optimised, the BUGB model supports both Thompson Sampling and UCBbased optimization. However, as further explored below, we obtained best performance by calculating 95% Bayesian Credible Bounds across the grid of input values, . By iteratively measuring the function value at the highest bound and then updating our estimate for using BUGB, we were able to quickly converge to the maxima of the function.
3 Empirical Results
In this section we evaluate the BUGB scheme by comparing it with the currently best performing approaches. Based on our comparison with these “reference” algorithms, it should be quite straightforward to also relate the BUGB performance results to the performance of other similar algorithms.
3.1 Experimental Setup
We have conducted numerous experiments using various functions, generating artificial data, under varying degrees of observation noise. The full range of empirical results all show the same trend, however, we here report performance on a representative subset of the experiment configurations, involving unimodal and multimodal functions, with varying degrees of noise and resolutions. Performance is measured in terms of Regret — the difference between the sum of rewards expected after successive rounds and what would have been obtained by always selecting the optimal point.
For these experiment configurations, an ensemble of independent replications with different random number streams was performed to minimize the variance of the reported results. In order to investigate the performance of the schemes under a broad spectrum of environments, we test the schemes using three different representative functions — one sloped, with a single maxima, and one more peaked with multiple local maxima, particularly similar to the global maxima. To investigate performance under varying degrees of noise we introduced i.i.d. Gaussian observation noise, , employing a diverse range of noise levels: . Regret is reported after 25, 50, 100, and 250 iterations for both the new accelerating scheme and the traditional static scheme.
3.2 Comparison of Regret
The regret measure is nontrivial, and so we provide further clarification here. In brief, the regret can be seen as the difference between the sum of rewards expected after successive arm pulls, and what would have been obtained by only pulling the optimal arm. To exemplify, assume that a reward yields a value (utility) of , and that a penalty is associated with the value . This implies that the expected utility of pulling arm is . Thus, if the optimal arm is arm , the regret after plays would become:
(2) 
with being the expected reward at arm pull , given the agent’s arm selection strategy.
Table 1 reports average regret over multiple functions for different number of time steps. As exemplified in Fig. 4, pursuing a UCB strategy, BUGB only needs 45 observations to capture the underlying function, allowing it to quickly zoom in on the most promising input value regions. The effect of this is seen in the table, with BUGB performing significantly better than the competing stateoftheart schemes.
Also notice how BUGB outperforms the Gaussian process based UCB approach, even when not receiving feedback on the gradient function. This could be explained by the ability of BUGB to infer gradient information indirectly by means of the noisy function value observations.
Algorithm / Time steps  25  50  100  250 

BUGB w/o gradient  20.0  26.4  34.2  50.9 
BUGB  13.3  16.7  21.8  31.5 
Multiarmed Bandit w/UCB  26.0  35.9  48.5  71.7 
Gaussian Process w/UCB  19.9  27.1  36.3  53.9 
Gradient Descent  23.0  43.6  79.2  171.9 
Uniform  38.1  76.4  152.8  381.1 
The above findings are confirmed by the plots in Fig. 5, showing that BUGB provides superior performance at every time step.The Gaussian process based approach is better than BUGB without gradient feedback up to time step ten or so, and then BUGB w/o gradients is slightly better for the remainder of the time steps.
Table 2 summarises performance under a diverse range of noise levels, from up to . BUGB is consistently the superior approach across all the noise levels, both with and without feedback on the gradient.
Algorithm / Noise  0.01  0.1  1.0  5.0 

BUGB w/o Gradient  12.8  17.8  50.9  110.57 
BUGB  11.5  13.6  31.5  55.6 
Multiarmed Bandit w/UCB  31.6  34.9  71.7  149.2 
Gaussian Process w/UCB  14.4  19.6  53.9  175.6 
Gradient Descent  212.3  201.7  171.9  249.6 
Uniform  382.1  381.7  381.1  380.5 
Interestingly, gradient descent improves performance in the mid range noise levels. This can be explained by the increased noise opening up for escaping local optima, however, performance falls again with the largest degree of noise.
Table 3 summarises computational performance. In all brevity, the model structure of BUGG lends itself to efficient computation by exploiting model structure for local computation. This leads to linear increase in computation time with respect to number of observations, as opposed to the much more computationally expensive Gaussian process based approach (with exact computation involving covariance matrix inversion).
BUGB  MAB  Gaussian Process  Gradient Descent 

1.74  0.58  789.7  0.04 
For all of these experiments, the gradient of the functions were precalculated, making gradient descent computationally extremely efficient.
4 Conclusions and Further Work
In this paper we have proposed a novel approach to global optimisation where bandit based and gradient based learning is combined. Our Bayesian model, BUGB, unifies the two paradigms in one integrated model. At the heart of the model we find a stochastic linear approximation of the function to be optimised. Here, both the gradient and function values are explicitly related. This allows us to learn from both noisy function and gradient observations, as well as predicting these properties across the action space to support optimisation.
We further proposed an accompanying bandit driven exploration scheme that use Bayesian credibility bounds to trade off exploration against exploitation. Our empirical results demonstrated that by unifying bandit and gradient based learning, one obtains consistently improved performance across a wide spectrum of environments. Furthermore, even when gradient feedback is unavailable, the flexibility of our model, including gradient prediction, allows us to still outperform competing approaches, although with a smaller margin. Due to the pervasiveness of bandit based optimisation, our scheme opens up for improved performance both in metaoptimisation and in applications where gradient information is readily available.
In future work, we propose that these pioneering results are expanded in a number of directions. First of all, BUGB needs to be generalised to cover multidimensional functions. Additionally, formal regret bounds and asymptotic properties needs to be established. Finally, it would be interesting to investigate how BUGB can be leveraged in novel application areas, such as metalearning in neural networks.
References
 [1] Y. Bengio, “GradientBased Optimization of Hyperparameters,” Neural Computation, vol. 12, pp. 1889 – 1900, 2000.
 [2] O. Atan, C. Tekin, and M. v.d. Schaar, “Global Multiarmed Bandits with Holder Continuity,” in Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015, pp. 28 – 36.
 [3] S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvari, “XArmed Bandits,” Journal of Machine Learning Research, vol. 12, pp. 1655 – 1695, 2011.
 [4] N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger, “InformationTheoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting,” IEEE Transactions on Information Theory, vol. 58, pp. 3250 – 3265, 2012.
 [5] O.C. Granmo, “Solving TwoArmed Bernoulli Bandit Problems Using a Bayesian Learning Automaton,” International Journal of Intelligent Computing and Cybernetics, vol. 3, no. 2, pp. 207–234, 2010.
 [6] S. L. Scott, “A modern Bayesian look at the multiarmed bandit,” Applied Stochastic Models in Business and Industry, no. 26, pp. 639–658, 2010.
 [7] B. C. May, N. Korda, A. Lee, and D. S. Leslie, “Optimistic Bayesian sampling in contextualbandit problems,” Submitted to the Annals of Applied Probability, 2011.
 [8] W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, vol. 25, pp. 285–294, 1933.
 [9] T. Norheim, T. Brådland, O.C. Granmo, and B. J. Oommen, “A Generic Solution to MultiArmed Bernoulli Bandit Problems Based on Random Sampling from Sibling Conjugate Priors,” in Proceedings of the Second International Conference on Agents and Artificial Intelligence (ICAART 2010). INSTICC, 2010, pp. 36–44.
 [10] O.C. Granmo and S. Berg, “Solving NonStationary Bandit Problems by Random Sampling from Sibling Kalman Filters,” in Proceedings of the Twenty Third International Conference on Industrial, Engineering, and Other Applications of Applied Intelligent Systems (IEAAIE 2010). Springer, 2010, pp. 199–208.
 [11] J. Vermorel and M. Mohri, “Multiarmed bandit algorithms and empirical evaluation,” in Proceedings of ECML 2005. Springer, 2005, pp. 437–448.
 [12] L. P. Kaelbling, “Learning in embedded systems,” Ph.D. dissertation, Stanford University, 1993.
 [13] P. Auer, N. CesaBianchi, and P. Fischer, “Finitetime Analysis of the Multiarmed Bandit Problem,” Machine Learning, vol. 47, pp. 235–256, 2002.