Bayesian Unification of Gradient and Bandit-based Learning for Accelerated Global Optimisation
Bandit based optimisation schemes have a remarkable advantage over gradient based approaches due to their global perspective, which eliminates the danger of getting stuck at local optima. However, for continuous optimisation problems or problems with a large number of actions, bandit based approaches can be hindered by slow learning. Gradient based approaches, on the other hand, navigate quickly in high-dimensional continuous spaces through local optimisation, following the gradient in fine grained steps. However, apart from being susceptible to local optima, these schemes are also less suited for online learning due to their reliance on extensive trial-and-error before the optimum can be identified. In contrast, bandit algorithms seek to identify the optimal action (global optima) in as few steps as possible. In this paper, we propose a Bayesian approach that unifies the above two distinct paradigms in one single framework, with the aim of combining their advantages. At the heart of our approach we find a stochastic linear approximation of the function to be optimised, where both the gradient and values of the function are explicitly captured. This model allows us to learn from both noisy function and gradient observations, as well as predicting these properties across the action space to support optimisation. We further propose an accompanying bandit driven exploration scheme that uses Bayesian credible bounds to trade off exploration against exploitation. Our empirical results demonstrate that by unifying bandit and gradient based learning, one obtains consistently improved performance across a wide spectrum of problem environments. Furthermore, even when gradient feedback is unavailable, the flexibility of our model, including gradient prediction, still allows us outperform competing approaches, although with a smaller margin. Due to the pervasiveness of bandit based optimisation, our scheme opens up for improved performance both in meta-optimisation and in applications where gradient related information is readily available.
1.1 Background and Motivation
The multi-armed bandit problem is a classical optimisation problem that captures the trade off between exploitation and exploration in reinforcement learning. The problem consists of an agent that sequentially pulls one out of multiple arms attached to a gambling machine, with each pull resulting in a scalar reward. Each reward is randomly drawn from an unknown distribution, unique to each arm. The purpose is to as quickly as possible identify the arm with the highest expected reward, through goal directed trial-and-error.
Bandit based optimisation schemes have a tremendous advantage over gradient based approaches (such as ) due to their global perspective, which eliminates the danger of getting stuck at local optima. However, for continuous optimisation problems or problems with a large number of arms (actions), bandit based approaches are hindered by their inability to generalise across arms (typically modelling arms as independent reward sources). This independence assumption leads to slow learning because the expected reward function must be inferred independently for each arm. Gradient based approaches, on the other hand, navigate quickly in high-dimensional continuous spaces through local optimisation, following the gradient in small steps. The local optimisation, however, makes this class of schemes susceptible to local optima. Further, they are less suited for on-line learning due to their reliance on extensive trial and error, with small parameter adjustments at each step. In contrast, bandit algorithms are designed for on-line operation, aiming to converge to the optimal arm (global optima) in as few trials as possible.
To deal with continuous and large action spaces, several bandit based approaches have recently been proposed that capture interaction among actions. One class of schemes, referred to as global multi-armed bandit schemes, models the expected rewards of the arms as (non-)linear functions of a global parameter . Another family of techniques attacks large action spaces through tree based searching, with X-Armed Bandits finding global maxima when the expected reward (objective) function is ”locally Lipschitz” . Finally, Gaussian processes have been applied for smoothing and interpolation, forming the foundation for bandit based exploration and exploitation in continuous action spaces .
Gaussian process based approaches are particularly attractive because they provide a Bayesian estimate of the expected reward functions including credible intervals, as illustrated in Fig. 1.
In brief, dedicated kernel functions capture smoothness and other function dynamics, encoded in a covariance matrix. However, as illustrated in the figure, the scheme is ”blind” towards the gradient of the underlying reward function (red line), merely ”tracing” a line through the observations (crosses), tending towards a prior mean without other input (typically set to zero).
In conclusion, gradient and bandit based scheme have distinct advantages and disadvantages.
1.2 Paper Contributions and Outline
In this paper we propose a radically new approach to stochastic optimisation where the global perspective of multi-armed bandits is combined with gradient based local optimisation, with the effect of significantly accelerating learning. In all brevity, our approach provides a Bayesian Unification of Gradient and Bandit-based learning (hereafter referred to as BUG-B). Our contributions can be summarised as follows:
At the heart of BUG-B we find a novel Bayesian model that explicitly connects the expected reward function with its gradient. The model supports learning from both noisy function values as well as gradient related observations. Further, unobserved function values and gradient information can be predicted across the action space to support goal-directed exploration and exploitation of the reward function.
Our empirical results demonstrate that by unifying bandit and gradient based learning, one obtains improved performance across a wide spectrum of reward functions and degrees of noise.
Even when gradient feedback is unavailable, the flexibility of our model, including gradient estimation, allows us to still outperform competing approaches, although with a smaller margin.
In Section 2 we provide the details of our Bayesian approach to unifying gradient and bandit-based learning. We introduce a grid based linear approximation of the reward function that explicitly relates function- and gradient values, modelled as a set of stochastic variables to address noisy observations and relationships. We then cover accompanying optimisation strategies based on Bayesian credibility bounds as well as Thompson Sampling, before we in Section 3 provide empirical results demonstrating the superiority of our scheme in a wide range of settings. We conclude the paper in Section 4 by providing pointers for further research.
2 Bayesian Unification of Gradient and Bandit-based Learning (BUG-B)
2.1 The BUG-B Model
The BUG-B model is based on a linear approximation of the expected reward function using a grid of input values, . This paper focuses on one-dimensional cases. The approximation then takes the following recursive form:
As illustrated in Fig. 2, for any input value , the function value, , is formulated in terms of the gradient, , and the function value, , of the preceding point, . Note that for , then becomes a constant. In all brevity, the gradient and function values are related in a manner that allows the underlying function to be approximated with arbitrary accuracy. That is, the approximation can be made arbitrarily accurate by making the grid increasingly fine grained: (for any differentiable function ).
We are now ready to present our novel Bayesian scheme that explicitly connects the expected reward function with its gradient.
As shown in Fig. 3, we model each and as stochastic variables and , respectively. These stochastic variables are normally distributed, with corresponding unknown means, and , and variations, and . As further seen in the figure, the relationship between variables are defined recursively, according to the aforementioned linear approximation scheme: . Here, captures uncertainty, representing i.i.d. Gaussian noise . Furthermore, we model the dynamics of the unknown gradient of by relating neighbouring stochastic gradient variables: . That is, the change rate is stochastically governed by i.i.d. Gaussian noise .
Using a factor graph based computation approach for the above model, we can efficiently calculate the posterior joint and marginal distributions for all the variables, given noisy information on both function and gradient values (the computational complexity grows linearly with the number of grid points).
2.2 Optimization Strategies with Thompson Sampling and Upper Confidence Intervals
From a rather broad perspective, there are currently two competing strategies for finding the global optimum in the bandit setting: Thompson sampling (stochastic probability matching schemes) and those based on upper confidence (or credibility) bounds. Thompson sampling tends to provide better performance than UCB-based approaches in empirical investigations, however, is known to over-explore. UCB-like approaches, on the other hand, provide a deterministic and more goal-directed path towards the global optimum, finding the optimum with probability arbitrarily close to unity. Thompson sampling, on the other hand, always converges to the global optimum (with unit probability). 
In  we proposed a Bayesian technique for solving bandit like problems, akin to the Thompson Sampling  principle, leading to novel schemes for handling multi-armed and non-stationary (restless) bandit problems [9, 10]. Empirical results demonstrated the advantages of these techniques over established top performers. Furthermore, we provided theoretical results stating that the original technique is instantaneously self-correcting and that it converges to only pulling the optimal arm with probability as close to unity as desired. Later on, as a further testimony to the renewed importance of the Thompson Sampling principle, a modern Bayesian look at the multi-armed bandit problem was also taken in [6, 7].
A promising avenue for solving the multi-armed bandit problem involves the methods which consider the estimation of confidence intervals, wherein the scheme estimates a confidence interval for the reward probability of each arm, and an “optimistic reward probability estimate” is identified for each arm. The arm with the most optimistic reward probability estimate is then greedily selected [11, 12].
In , the authors analysed several confidence interval based algorithms. These algorithms also provide logarithmically increasing regret, with UCB-Tuned – a variant of the well-known UCB1 algorithm — outperforming both UCB1, UCB2, as well as the -greedy strategy. In brief, in UCB-Tuned, the following optimistic estimates are used for each arm :
where and are the sample mean and variance of the rewards that have been obtained from arm , is the total number of arm pulls, and is the number of times arm has been pulled. Thus, the quantity added to the sample average of a specific arm is steadily reduced as the arm is pulled, and the corresponding uncertainty about the reward probability is reduced. As a result, by always selecting the arm with the highest optimistic reward estimate, UCB-Tuned gradually shifts from exploration to exploitation.
By providing a Bayesian estimate of the function to be optimised, the BUG-B model supports both Thompson Sampling and UCB-based optimization. However, as further explored below, we obtained best performance by calculating 95% Bayesian Credible Bounds across the grid of input values, . By iteratively measuring the function value at the highest bound and then updating our estimate for using BUG-B, we were able to quickly converge to the maxima of the function.
3 Empirical Results
In this section we evaluate the BUG-B scheme by comparing it with the currently best performing approaches. Based on our comparison with these “reference” algorithms, it should be quite straightforward to also relate the BUG-B performance results to the performance of other similar algorithms.
3.1 Experimental Setup
We have conducted numerous experiments using various functions, generating artificial data, under varying degrees of observation noise. The full range of empirical results all show the same trend, however, we here report performance on a representative subset of the experiment configurations, involving uni-modal and multi-modal functions, with varying degrees of noise and resolutions. Performance is measured in terms of Regret — the difference between the sum of rewards expected after successive rounds and what would have been obtained by always selecting the optimal point.
For these experiment configurations, an ensemble of independent replications with different random number streams was performed to minimize the variance of the reported results. In order to investigate the performance of the schemes under a broad spectrum of environments, we test the schemes using three different representative functions — one sloped, with a single maxima, and one more peaked with multiple local maxima, particularly similar to the global maxima. To investigate performance under varying degrees of noise we introduced i.i.d. Gaussian observation noise, , employing a diverse range of noise levels: . Regret is reported after 25, 50, 100, and 250 iterations for both the new accelerating scheme and the traditional static scheme.
3.2 Comparison of Regret
The regret measure is non-trivial, and so we provide further clarification here. In brief, the regret can be seen as the difference between the sum of rewards expected after successive arm pulls, and what would have been obtained by only pulling the optimal arm. To exemplify, assume that a reward yields a value (utility) of , and that a penalty is associated with the value . This implies that the expected utility of pulling arm is . Thus, if the optimal arm is arm , the regret after plays would become:
with being the expected reward at arm pull , given the agent’s arm selection strategy.
Table 1 reports average regret over multiple functions for different number of time steps. As exemplified in Fig. 4, pursuing a UCB strategy, BUG-B only needs 4-5 observations to capture the underlying function, allowing it to quickly zoom in on the most promising input value regions. The effect of this is seen in the table, with BUG-B performing significantly better than the competing state-of-the-art schemes.
Also notice how BUG-B outperforms the Gaussian process based UCB approach, even when not receiving feedback on the gradient function. This could be explained by the ability of BUG-B to infer gradient information indirectly by means of the noisy function value observations.
|Algorithm / Time steps||25||50||100||250|
|BUG-B w/o gradient||20.0||26.4||34.2||50.9|
|Multi-armed Bandit w/UCB||26.0||35.9||48.5||71.7|
|Gaussian Process w/UCB||19.9||27.1||36.3||53.9|
The above findings are confirmed by the plots in Fig. 5, showing that BUG-B provides superior performance at every time step.The Gaussian process based approach is better than BUG-B without gradient feedback up to time step ten or so, and then BUG-B w/o gradients is slightly better for the remainder of the time steps.
Table 2 summarises performance under a diverse range of noise levels, from up to . BUG-B is consistently the superior approach across all the noise levels, both with and without feedback on the gradient.
|Algorithm / Noise||0.01||0.1||1.0||5.0|
|BUG-B w/o Gradient||12.8||17.8||50.9||110.57|
|Multi-armed Bandit w/UCB||31.6||34.9||71.7||149.2|
|Gaussian Process w/UCB||14.4||19.6||53.9||175.6|
Interestingly, gradient descent improves performance in the mid range noise levels. This can be explained by the increased noise opening up for escaping local optima, however, performance falls again with the largest degree of noise.
Table 3 summarises computational performance. In all brevity, the model structure of BUG-G lends itself to efficient computation by exploiting model structure for local computation. This leads to linear increase in computation time with respect to number of observations, as opposed to the much more computationally expensive Gaussian process based approach (with exact computation involving covariance matrix inversion).
|BUG-B||MAB||Gaussian Process||Gradient Descent|
For all of these experiments, the gradient of the functions were pre-calculated, making gradient descent computationally extremely efficient.
4 Conclusions and Further Work
In this paper we have proposed a novel approach to global optimisation where bandit based and gradient based learning is combined. Our Bayesian model, BUG-B, unifies the two paradigms in one integrated model. At the heart of the model we find a stochastic linear approximation of the function to be optimised. Here, both the gradient and function values are explicitly related. This allows us to learn from both noisy function and gradient observations, as well as predicting these properties across the action space to support optimisation.
We further proposed an accompanying bandit driven exploration scheme that use Bayesian credibility bounds to trade off exploration against exploitation. Our empirical results demonstrated that by unifying bandit and gradient based learning, one obtains consistently improved performance across a wide spectrum of environments. Furthermore, even when gradient feedback is unavailable, the flexibility of our model, including gradient prediction, allows us to still outperform competing approaches, although with a smaller margin. Due to the pervasiveness of bandit based optimisation, our scheme opens up for improved performance both in meta-optimisation and in applications where gradient information is readily available.
In future work, we propose that these pioneering results are expanded in a number of directions. First of all, BUG-B needs to be generalised to cover multi-dimensional functions. Additionally, formal regret bounds and asymptotic properties needs to be established. Finally, it would be interesting to investigate how BUG-B can be leveraged in novel application areas, such as meta-learning in neural networks.
-  Y. Bengio, “Gradient-Based Optimization of Hyperparameters,” Neural Computation, vol. 12, pp. 1889 – 1900, 2000.
-  O. Atan, C. Tekin, and M. v.d. Schaar, “Global Multi-armed Bandits with Holder Continuity,” in Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015, pp. 28 – 36.
-  S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvari, “X-Armed Bandits,” Journal of Machine Learning Research, vol. 12, pp. 1655 – 1695, 2011.
-  N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger, “Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting,” IEEE Transactions on Information Theory, vol. 58, pp. 3250 – 3265, 2012.
-  O.-C. Granmo, “Solving Two-Armed Bernoulli Bandit Problems Using a Bayesian Learning Automaton,” International Journal of Intelligent Computing and Cybernetics, vol. 3, no. 2, pp. 207–234, 2010.
-  S. L. Scott, “A modern Bayesian look at the multi-armed bandit,” Applied Stochastic Models in Business and Industry, no. 26, pp. 639–658, 2010.
-  B. C. May, N. Korda, A. Lee, and D. S. Leslie, “Optimistic Bayesian sampling in contextual-bandit problems,” Submitted to the Annals of Applied Probability, 2011.
-  W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, vol. 25, pp. 285–294, 1933.
-  T. Norheim, T. Brådland, O.-C. Granmo, and B. J. Oommen, “A Generic Solution to Multi-Armed Bernoulli Bandit Problems Based on Random Sampling from Sibling Conjugate Priors,” in Proceedings of the Second International Conference on Agents and Artificial Intelligence (ICAART 2010). INSTICC, 2010, pp. 36–44.
-  O.-C. Granmo and S. Berg, “Solving Non-Stationary Bandit Problems by Random Sampling from Sibling Kalman Filters,” in Proceedings of the Twenty Third International Conference on Industrial, Engineering, and Other Applications of Applied Intelligent Systems (IEA-AIE 2010). Springer, 2010, pp. 199–208.
-  J. Vermorel and M. Mohri, “Multi-armed bandit algorithms and empirical evaluation,” in Proceedings of ECML 2005. Springer, 2005, pp. 437–448.
-  L. P. Kaelbling, “Learning in embedded systems,” Ph.D. dissertation, Stanford University, 1993.
-  P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time Analysis of the Multiarmed Bandit Problem,” Machine Learning, vol. 47, pp. 235–256, 2002.