Bayesian Optimisation over Multiple Continuous and Categorical Inputs
Abstract
Efficient optimisation of blackbox problems that comprise both continuous and categorical inputs is important, yet poses significant challenges. We propose a new approach, Continuous and Categorical Bayesian Optimisation (CoCaBO), which combines the strengths of multiarmed bandits and Bayesian optimisation to select values for both categorical and continuous inputs. We model this mixedtype space using a Gaussian Process kernel, designed to allow sharing of information across multiple categorical variables, each with multiple possible values; this allows CoCaBO to leverage all available data efficiently. We extend our method to the batch setting and propose an efficient selection procedure that dynamically balances exploration and exploitation whilst encouraging batch diversity. We demonstrate empirically that our method outperforms existing approaches on both synthetic and realworld optimisation tasks with continuous and categorical inputs.
Bayesian Optimisation over Multiple Continuous and Categorical Inputs
Binxin Ru ^{†}^{†}thanks: These authors contributed equally. Department of Engineering Science University of Oxford robin@robots.ox.ac.uk Ahsan S. Alvi^{†}^{†}footnotemark: Department of Engineering Science University of Oxford asa@robots.ox.ac.uk Vu Nguyen Department of Engineering Science University of Oxford vu@robots.ox.ac.uk Michael A. Osborne Department of Engineering Science University of Oxford mosb@robots.ox.ac.uk Stephen J Roberts Department of Engineering Science University of Oxford sjrob@robots.ox.ac.uk
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Existing work has shown Bayesian optimisation (BO) to be remarkably successful at optimising functions with continuous input spaces [29, 17, 18, 26, 28, 12, 1]. However, in many situations, optimisation problems involve a mixture of continuous and categorical variables. For example, with a deep neural network, we may want to adjust the learning rate and the number of units in each layer (continuous), as well as the activation function type in each layer (categorical). Similarly, in a gradient boosting ensemble of decision trees, we may wish to adjust the learning rate and the maximum depth of the trees (both continuous), as well as the boosting algorithm and loss function (both categorical).
Having a mixture of categorical and continuous variables presents unique challenges. If some inputs are categorical variables, as opposed to continuous, then the common assumption that the BO acquisition function is differentiable and continuous over the input space, which allows the acquisition function to be efficiently optimised, is no longer valid. Recent research has dealt with categorical variables in different ways. The simplest approach for BO with Gaussian process (GP) surrogates is to use a onehot encoding on the categorical variables so that they can be treated as continuous variables, and perform BO on the transformed space [4]. Alternatively, the mixedtype inputs can be handled using a hierarchical structure, such as using random forests [19, 5] or multiarmed bandits (MABs) [16]. These approaches come with their own challenges, which we will discuss below (see Section 3). In particular, the existing approaches are not well designed for multiple categorical variables with multiple possible values. Additionally, no GPbased BO methods have explicitly considered the batch setting for continuouscategorical inputs, to the best of our knowledge.
In this paper, we present a new Bayesian optimisation approach for optimising a blackbox function with multiple continuous and categorical inputs, termed Continuous and Categorical Bayesian Optimisation (CoCaBO). Our approach is motivated by the success of MABs [2, 3] in identifying the best value(s) from a discrete set of options.
Our main contributions are as follows:

We propose a novel method which combines the strengths of MABs and BO to optimise blackbox functions with multiple categorical and continuous inputs. (Section 4.1).

We present a GP kernel to capture complex interactions between the continuous and categorical inputs (Section 4.2). Our kernel allows sharing of information across different categories without resorting to onehot transformations.

We introduce a novel batch selection method for mixed input types that extends CoCaBO to the parallel setting, and dynamically balances exploration and exploitation and encourages batch diversity (Section 4.3).

We demonstrate the effectiveness of our methods on a variety of synthetic and realworld optimisation tasks with multiple categorical and continuous inputs (Section 5).
2 Preliminaries
In this paper, we consider the problem of optimising a blackbox function where the input consists of both continuous and categorical inputs, , where are the categorical variables, with each variable taking one of different values, and is a point in a dimensional hypercube . Formally, we aim to find the best configuration to maximise the blackbox function
(1) 
by making a series of evaluations ,…, . Later we extend our method to allow parallel evaluation of multiple points, by selecting a batch at each optimisation step .
Bayesian optimisation [7, 28] is an approach for optimising a blackbox function such that its optimal value is found using a small number of evaluations. BO often uses a Gaussian process [25] surrogate to model the objective . A GP defines a probability distribution over functions , as , where and are the mean and covariance functions respectively, which encode our prior beliefs about . Using the GP posterior, BO defines an acquisition function which is optimised to identify the next location to sample . Unlike the original objective function , the acquisition function is cheap to compute and can be optimised using standard techniques.
3 Related Work
3.1 Onehot encoding
A common method for dealing with categorical variables is to transform them into a onehot encoded representation, where a variable with choices is transformed into a vector of length with a single nonzero element. This is the approach followed by the popular BO packages like Spearmint [29] and GPyOpt [15, 4].
There are two main drawbacks with this approach. First, the commonlyused RBF (squared exponential, radial basis function) and Matérn kernels in the GP surrogate assume that is continuous and differentiable in the input space, which is clearly not the case for onehot encoded variables, as the objective is only defined for a small subspace within this representation.
The second drawback is that the acquisition function is optimised as a continuous function. By using this extended representation, we are turning the optimisation into a significantly harder problem due to the increased dimensionality of the search space. Additionally, the onehot encoding makes our problem sparse, especially when we have multiple categories, each with multiple choices. This causes distances between inputs to become large, reducing the usefulness of the surrogate at such locations. As a result, the optimisation landscape is characterised by many flat regions, making it difficult to optimise [24].
3.2 Hierarchical approaches
Random forests (RFs) [6] can naturally consider continuous and categorical variables, and are used in SMAC [19] as the underlying surrogate model for . However, the predictive distribution of the RF, which is used to select the next evaluation, is less reliable, as it relies on randomness introduced by the bootstrap samples and the randomly chosen subset of variables to be tested at each node to split the data. Moreover, RFs can easily overfit and we need to carefully choose the number of trees. Another treebased approach is Tree Parzen Estimator (TPE) [5] which is an optimisation algorithm based on treestructured Parzen density estimators. TPE uses nonparametric Parzen kernel density estimators to model the distribution of good and bad configurations w.r.t. a reference value. Due to the nature of kernel density estimators, TPE also supports continuous and discrete spaces.
Another more recent approach is EXP3BO [16], which can deal with mixed categorical and continuous input spaces by utilising a MAB. When the categorical variable is selected by the MAB, EXP3BO constructs a GP surrogate specific to the chosen category for modelling the continuous domain, i.e. it shares no information across the different categories. The observed data are divided into smaller subsets, one for each category, and as a result EXP3BO can handle only a small number of categorical choices and requires a large number of samples.
4 Continuous and Categorical Bayesian Optimisation (CoCaBO)
4.1 CoCaBO Acquisition Procedure
Our proposed method, Continuous and Categorical Bayesian Optimisation, harnesses both the advantages of multiarmed bandits to select categorical inputs and the strength of GPbased BO in optimising continuous input spaces. The CoCaBO procedure is shown in Algorithm 1. CoCaBO first decides the values of the categorical inputs by using a MAB (Step 4 in Algorithm 1). Given , it then maximises the acquisition function to select the continuous part which forms the next point for evaluation, as illustrated in Figure 1.
For the MAB, we chose the EXP3 [3] method because it makes comparatively fewer assumptions on reward distributions and can be used under more general conditions, unlike UCB and greedy for example that assume i.i.d. rewards. See e.g. [2] for a review of MAB methods. For our procedure, we define the MAB’s reward for each category as the best function value observed so far from that category. Since the bestsofar statistic is not independent across iterations, the reward distribution is not i.i.d.
1: Input: A blackbox function , observation data , maximum number of iterations 2: Output: The best recommendation 3: for do 4: Select EXP3() 5: Select 6: Query at to obtain 7: 8: end for 
By using the MAB to decide the values for categorical inputs, we only need to optimise the acquisition function over the continuous subspace . In comparison to onehot based methods, whose acquisition functions are defined over , our approach enjoys a significant reduction in the difficulty and cost of optimising the acquisition function^{1}^{1}1To optimise the acquisition function to within accuracy using a grid search or branchandbound optimiser, our approach requires only calls [20] and onehot approaches require calls. The cost saving grows exponentially with the number of categories and number of choices for each category ..
In Figure 2, we demonstrate the effectiveness of our approach in dealing with categorical variables via a simple synthetic example Func2C (described in Section 5.1), which comprises two categorical inputs, () and (), and two continuous inputs. The optimal function value lies in the subspace when both categorical variables . The categories chosen by CoCaBO at each iteration, the histogram of all selections and the rewards for each category are shown for 200 iterations. We can see that CoCaBO successfully identifies and focuses on the correct categories.
4.2 CoCaBO kernel design
We propose to use a combination of two separate kernels: will combine a kernel defined over the categorical inputs, with for the continuous inputs.
For the categorical kernel, we propose using an indicatorbased similarity metric, , where is the kernel variance and if and is zero otherwise. This kernel can be derived as a special case of a RBF kernel, which is explored in Appendix B
There are several ways of combining kernels that result in valid kernels [11]. One approach is to sum them together. Using a sum of kernels, that are each defined over different subsets of an input space, has been used successfully for BO in the context of highdimensional optimisation in the past [20]. Simply adding the continuous kernel to the categorical kernel , though, provides limited expressiveness, as this translates in practice to learning a single common trend over , and an offset depending on .
An alternative approach is to use the product . This form allows the kernel to encode couplings between the continuous and categorical domains, allowing a richer set of relationships to be captured, but if there are no overlapping categories in the data, which is likely to occur in early iterations of BO, this would cause the product kernel to be zero and prevent the model from learning.
We therefore propose our CoCaBO kernel to automatically exploit their strengths and avoid weaknesses of the sum and product kernels by a tradeoff parameter which can be optimised jointly with the GP hyperparameters (see Appendix C):
(2) 
where is a hyperparameter controlling the relative contribution of the sum vs product kernels.
It is worth highlighting a key benefit of our formulation over alternative hierarchical methods discussed in Section 3.2: rather than dividing our data into a subset for each combination of categories, we instead leverage all of our acquired data at every stage of the optimisation, as our kernel is able to combine information from data within the same category as well as from different categories, which improves its modelling performance. We compare the regression performance of the CoCaBO kernel and a onehot encoded kernel on some synthetic functions in Section 5.1.1.
4.3 Batch CoCaBO
1: Input: Surrogate data 2: Output: The batch 3: 4: are the unique categorical values in and their counts 5: Initialise and 6: for do 7: 8: and 9: 10: end for 11: return 
Our focus on optimising computer simulations and modelling pipelines provides a strong motivation to extend CoCaBO to select and evaluate multiple tasks at each iteration, in order to better utilise available hardware resources [29, 31, 27, 9].
The batch CoCaBO algorithm uses the “multiple plays” formulation of EXP3, called EXP3.M [3], which returns a batch of categorical choices, and combines it with the Kriging Believer (KB)^{2}^{2}2Note that our approach can easily utilise other batch selection techniques if desired. [14] batch method to select the batch points in the continuous domain. We choose KB for the batch creation, as it can consider alreadyselected batch points, including those with different categorical values, without making significant assumptions that other popular techniques may make, e.g. local penalisation [15, 1] assumes that is Lipschitz continuous.Our novel contribution is a method for combining the batch points selected by EXP3.M with batch BO procedures for continuous input spaces. Assume we are selecting a batch of points at iteration . A simple approach is to select a batch of categorical variables and then choose a corresponding continuous variable for each categorical point as in the sequential algorithm above, thus forming . However, such a batch method may not identify unique locations, as some values in may be repeated, which is even more problematic when the number of possible combinations for the categorical variables, , is smaller than the batch size , as we would never identify a full unique batch.
Our batch selection method, outlined in Algorithm 2, allows us to create a batch of unique choices by allocating multiple continuous batch points to more desirable categories.
The key idea is to first collect all of the unique categorical choices and how often they occur from the MAB. These counts define how many continuous batch points will be selected for each categorical choice. For each unique , we select a number of batch points equal to its number of occurrences in the MAB batch.
This is illustrated in Figure 3 for two possible scenarios. The benefit of using KB here is that the algorithm can take into account selections across the different to impose diversity in the batch in a consistent manner.
5 Experiments
We compared CoCaBO against a range of existing methods which are able to handle problems with mixed type inputs: GPbased Bayesian optimisation with onehot encoding (Onehot BO) [4], SMAC [19] and TPE [5]. For all the baseline methods, we used their publicly available Python packages^{3}^{3}3Onehot BO: https://github.com/SheffieldML/GPyOpt, SMAC: https://github.com/automl/pysmac, TPE: https://github.com/hyperopt/hyperopt. CoCaBO and Onehot BO both use the UCB acquisition function [30] with scale parameter . We did not compare against EXP3BO [16] because we focus on optimisation problems involving multiple categorical inputs with multiple possible values, and EXP3BO is able to handle only one categorical input with few possible values as discussed in Section 3.2.
In all experiments, we tested four different values for our method^{4}^{4}4Implementation will be made available via a GitHub repository.: , where means is optimised as a hyperparameter. This leads to four variants of our method: CoCaBO, CoCaBO, CoCaBO and CoCaBOauto. We used a Matérn52 kernel for , as well as for Onehot BO, and used the indicatorbased kernel discussed in Section 4.2 for . For both our method and Onehot BO, we optimised the GP hyperparameters by maximising the log marginal likelihood every 10 iterations using multistarted gradient descent, see Appendix C for more details.
We tested all these methods on a diverse set of synthetic and real problems in both sequential and batch settings. TPE is only used in the sequential setting because its package HyperOpt does not provide a synchronous batch implementation. For all the problems, the continuous inputs were normalised to and we started each optimisation method with random initial points. We ran each sequential optimisation for iterations and each batch optimisation with for iterations. Due to space constraints, the sequential experimental results are provided in Appendix E. All experiments were conducted on a 36core 2.3GHz Intel Xeon processor with 512 GB RAM.
5.1 Synthetic experiments
We tested the different methods on a number of synthetic functions. Func2C is a test problem with continuous inputs () and categorical inputs (). The categorical inputs control a linear combination of three dimensional global optimisation benchmark functions: beale, sixhump camel and rosenbrock. This is also the test function used for the illustration in Figure 2. Func3C is similar to Func2C but with categorical inputs which leads to more complicated combinations of the three functions.
To test the performance of CoCaBO on problems with large numbers of categorical inputs and/or inputs with large numbers of categorical choices, we generated another series of synthetic function, AckleyC, with and . Here, we convert dimensions of the dimensional Ackley function into categories each. A detailed description of these synthetic functions is provided in Appendix D.1.
5.1.1 Predictive performance of the CoCaBO posterior
We first investigate the quality of the CoCaBO surrogate by comparing its modelling performance against a standard GP with onehot encoding. We train each model on randomly sampled data points and evaluate the predictive log likelihood on test data points. The mean and standard error over random initialisations are presented in Table 1. The results showcase the benefit of using the CoCaBO kernel over a kernel with onehot encoded inputs, especially when the number of categorical inputs grows. The CoCaBO kernel, which allows it to learn a richer set of variations from the data, leads to consistently better outofsample predictions.
Func2C  Func3C  Ackley2C  Ackley3C  Ackley4C  Ackley5C  
CoCaBO  531 260  435 85.7  74.7 9.42  47.2 9.20  28.3 13.7  23.5 5.50 
Onehot  254 98.0  748 42.4  77.9 14.2  73.4 18.3  59.8 18.0  7.98 12.5 
5.1.2 Optimisation performance of CoCaBO on synthetic test functions
We evaluated the optimisation performance of our proposed CoCaBO methods and other existing methods on Func2C, Func3C and Ackley5C. The mean and standard error over random repetitions in the batch setting with a batch size of are presented in Figure 4. The results in the sequential setting are included in Appendix E.1. For both settings, CoCaBO methods outperform all other competing approaches in these synthetic problems with CoCaBOauto demonstrating the best performance overall. We note that CoCaBO outperformed Onehot BO on the Func2C optimisation task, despite its surrogate performing worse in the prediction experiment in Table 1, which we attribute to the strength of CoCaBO in selecting the right categorical values compared to Onehot BO.
5.2 Realworld experiments
Now we move to experiments on realworld tasks of hyperparameter tuning for machine learning algorithms. The first task (SVMBoston) outputs the negative mean square test error of using a support vector machine (SVM) for regression on the Boston housing dataset [10]. The second task (XGMNIST) returns classification test accuracy of a XGBoost model [8] on MNIST [22]. The third problem (NNYacht) returns the negative log likelihood of a onehiddenlayer neural network regressor on the test set of the Yacht Hydrodynamics dataset ^{5}^{5}5We follow the implementation in https://github.com/yaringal/DropoutUncertaintyExps [10]. A brief summary of categorical and continuous inputs for these problems is shown in Table 2 and a more detailed description of the inputs and implementation details is provided in Appendix D.2.
The mean and standard error of the optimisation performance over random repetitions in the batch setting are presented in Figure 5. CoCaBO methods again show superior performance over other batch methods in these realworld problems. In the XGMNIST task where all the categorical inputs have only binary choices, SMAC performs well but it is still overtaken by CoCaBO. We note that despite CocaBOauto still remaining very competitive, the strong performance of CoCaBO suggests independence between the categorical and continuous input spaces in these realworld tasks, making the additive kernel structure sufficient.
SVMBoston (, )  NNYacht (, )  XGMNIST(, )  
kernel type (),  activation type (),  booster type (),  
kernel coefficient (),  optimiser type (),  grow policies (),  
using shrinking ()  suggested dropout ()  training objectives ()  
penalty parameter,  learning rate,  learning rate, regularisation,  
tolerance for stopping,  number of neurons,  maximum depth, subsample,  
model complexity  aleatoric variance ()  minimum split loss 
6 Conclusion
Existing BO literature uses onehot transformations or hierarchical approaches to encode realworld problems involving mixed continuous and categorical inputs. We presented a solution from a novel perspective, called Continuous and Categorical Bayesian Optimisation (CoCaBO), that harnesses the strengths of multiarmed bandits and GPbased BO to tackle this problem. Our method uses a new kernel structure, which allows us to capture information within categories as well as across different categories. This leads to more efficient use of the acquired data and improved modelling power. We extended CoCaBO to the batch setting, enabling parallel evaluations at each stage of the optimisation. CoCaBO demonstrated strong performance over existing methods on a variety of synthetic and realworld optimisation tasks with multiple continuous and categorical inputs. We find CoCaBO to offer a very competitive alternative to existing approaches.
References
 [1] Ahsan S Alvi, Binxin Ru, Jan Calliess, Stephen J Roberts, and Michael A Osborne. Asynchronous batch Bayesian optimisation with improved local penalisation. arXiv preprint arXiv:1901.10452, 2019.
 [2] Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 [3] Peter Auer, Nicolo CesaBianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
 [4] The GPyOpt authors. GPyOpt: A Bayesian optimization framework in python. http://github.com/SheffieldML/GPyOpt, 2016.
 [5] J S Bergstra, R Bardenet, Y Bengio, and B Kégl. Algorithms for hyperparameter optimization. In Advances in Neural Information Processing Systems, pages 2546–2554, 2011.
 [6] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
 [7] Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
 [8] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
 [9] Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel Gaussian process optimization with upper confidence bound and pure exploration. In Machine Learning and Knowledge Discovery in Databases, pages 225–240. Springer, 2013.
 [10] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
 [11] David Duvenaud, James Lloyd, Roger Grosse, Joshua Tenenbaum, and Ghahramani Zoubin. Structure discovery in nonparametric regression through compositional kernel search. In International Conference on Machine Learning, pages 1166–1174, 2013.
 [12] Peter I Frazier. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
 [13] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML16), 2016.
 [14] David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. Kriging is wellsuited to parallelize optimization. In Computational Intelligence in Expensive Optimization Problems, pages 131–162. Springer, 2010.
 [15] Javier González, Zhenwen Dai, Philipp Hennig, and Neil D Lawrence. Batch Bayesian optimization via local penalization. In International Conference on Artificial Intelligence and Statistics, pages 648–657, 2016.
 [16] Shivapratap Gopakumar, Sunil Gupta, Santu Rana, Vu Nguyen, and Svetha Venkatesh. Algorithmic assurance: An active approach to algorithmic testing using Bayesian optimisation. In Advances in Neural Information Processing Systems, pages 5465–5473, 2018.
 [17] Philipp Hennig and Christian J Schuler. Entropy search for informationefficient global optimization. Journal of Machine Learning Research, 13:1809–1837, 2012.
 [18] José Miguel HernándezLobato, Michael Gelbart, Matthew Hoffman, Ryan Adams, and Zoubin Ghahramani. Predictive entropy search for Bayesian optimization with unknown constraints. In International Conference on Machine Learning, pages 1699–1707, 2015.
 [19] Frank Hutter, Holger H Hoos, and Kevin LeytonBrown. Sequential modelbased optimization for general algorithm configuration. In Learning and Intelligent Optimization, pages 507–523. Springer, 2011.
 [20] Kirthevasan Kandasamy, Jeff Schneider, and Barnabás Póczos. High dimensional Bayesian optimisation and bandits via additive models. In International Conference on Machine Learning, pages 295–304, 2015.
 [21] Brian Kulis and Michael I Jordan. Revisiting kmeans: New algorithms via Bayesian nonparametrics. arXiv preprint arXiv:1111.0352, 2011.
 [22] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
 [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 [24] Santu Rana, Cheng Li, Sunil Gupta, Vu Nguyen, and Svetha Venkatesh. High dimensional Bayesian optimization with elastic Gaussian process. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 2883–2891, 2017.
 [25] C E Rasmussen and C K I Williams. Gaussian processes for machine learning. 2006.
 [26] Binxin Ru, Michael Osborne, and Mark McLeod. Fast informationtheoretic Bayesian optimisation. In International Conference on Machine Learning, 2018.
 [27] Amar Shah and Zoubin Ghahramani. Parallel predictive entropy search for batch global optimization of expensive objective functions. In Advances in Neural Information Processing Systems, pages 3312–3320, 2015.
 [28] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
 [29] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
 [30] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, pages 1015–1022, 2010.
 [31] Jian Wu and Peter I. Frazier. The parallel knowledge gradient method for batch Bayesian optimization. In NIPS, 2016.
Appendix A Notation summary
Notation  Type  Meaning  
scalar 


search domain  continuous search space where is the dimension  
scalar  dimension of the continuous variable  
scalar  dimension of categorical variables  
vector  a continuous selection by BO at iteration  
scalar  number of choices for categorical variable  
vector  vector of categorical variables  
vector 


set  observation set 
Appendix B Categorical kernel relation with RBF
In this section we discuss the relationship between the categorical kernel we have proposed and a RBF kernel. Our categorical kernel is reproduced here for ease of access:
(3) 
Apart from the intuitive argument, that this kernel allows us to model the degree of similarity between two categorical selections, this kernel can also be derived as a special case of an RBF kernel. Consider the standard RBF kernel with unit variance evaluated between two scalar locations and :
(4) 
The lengthscale in Eq, 4 allows us to define the similarity between the two inputs, and, as the lengthscale becomes smaller, the distance between locations that would be considered similar (i.e. high covariance) shrinks. The limiting case states that if two inputs are not exactly the same as each other, then they provide no information for inferring the GP posterior’s value at each other’s locations. This causes the kernel to turn into an indicator function as in Eq. 3 above [21]:
(5) 
By adding one such RBF kernel with for each categorical variable in and normalising the output we arrive at the form in Eq. 3.
Appendix C Learning the hyperparameters in the CoCaBO kernel
We present the derivative for estimating the variable in our CoCaBO kernel.
(6) 
The hyperparameters of the kernel are optimised by maximising the log marginal likelihood (LML) of the GP surrogate
(7) 
where we collected the the hyperparameters of both kernels as well as the CoCaBO hyperparameter into . The LML and its derivative are defined as [25]
(8) 
(9) 
where are the function values at sample locations and is the kernel matrix of evaluated on the training data.
Optimisation of the LML was performed via multistarted gradient descent. The gradient in Equation 9 relies on the gradient of the kernel w.r.t. each of its parameters:
(10)  
(11)  
(12) 
where we used the shorthand , and .
Appendix D Further details for the optimisation problems
d.1 Synthetic test functions
We generated several synthetic test functions: Func2C, Func3C and a AckleyC series, whose input spaces comprise both continuous variables and multiple categorical variables. Each of the categorical inputs in all three test functions have multiple values. Func2C is a test problem with continuous inputs () and categorical inputs (). The categorical inputs decide the linear combinations between three dimensional global optimisation benchmark functions: beale (bea), sixhump camel (cam) and rosenbrock (ros)^{6}^{6}6The analytic forms of these functions are available at https://www.sfu.ca/~ssurjano/optimization.html. Func3C is similar to Func2C but with categorical inputs () which leads to more complicated linear combinations among the three functions. We also generated a series of synthetic functions, AckleyC, with categorical inputs and continous input (). Here, we convert dimensions of the dimensional Ackley function into categories each. Lastly, we generate a variant of Ackley5C, named Ackley5C5, which divides dimensions of the 6D Ackley function into categories each. The value range for both continuous and categorical inputs of these functions are shown in Table 4.
Function  Inputs  Input values  






for  



d.2 Realworld problems
Problems  Inputs  Input values  

kernel type  linear, poly, RBF, sigmoid  

kernel coefficient  scale, auto  
shrinking  shrinking on, shrinking off  
penalty parameter  
tolerance for stopping  



activation type  ReLU, tanh, sigmoid  
optimiser type  SGD, Adam, RMSprop, AdaGrad  
suggested dropout value  
learning rate  
number of neurons  
aleatoric variance  

booster type  gbtree, dart  
grow policies  depthwise, loss  
training objective  softmax, softprob  
learning rate  
maximum dept  
minimum split loss  
subsample  
regularisation 
We defined three realworld tasks of tuning the hyperparameters for ML algorithms: SVMBoston, NNYacht and XGMNIST.
SVMBoston outputs the negative mean square error of support vector machine (SVM) for regression on the test set of Boston housing dataset. We use the Nu Support Vector regression algorithm in the scikitlearn package [23] and use of the data for testing.
NNYacht returns the negative log likelihood of a onehiddenlayer neural network regressor on the test set of Yacht hydrodynamics dataset. We follow the MC Dropout implementation and the random train/test split on the dataset proposed in [13]^{7}^{7}7Code and data are available at https://github.com/yaringal/DropoutUncertaintyExps. The simple neural network is trained on mean squared error objective for 20 epochs with a batch size of . We run stochastic forward passes in the testing stage to approximate the predictive mean and variance.
Finally, XGMNIST returns classification accuracy of a XGBoost algorithm [8] on the testing set of the MNIST dataset. We use the package and adopt a stratified train/test split of .
The hyperparameters over which we optimise for each abovementioned ML task are summarised in Table 5. One point to note is that we present the unnormalised range for the continuous inputs in Table 5 but normalise all continuous inputs to for optimisation in our experiments. All the remaining hyperparameters are set to their default values.
Appendix E Additional experimental results
e.1 Additional results for synthetic functions
We evaluated the optimisation performance of our proposed CoCaBO methods and other existing methods on Func2C, Func3C and Ackley5C in the sequential setting . The mean and standard error over random repetitions are presented in Figure 6. It is evident that CoCaBO methods perform very competitively, if not better than, all other counterparts on these synthetic problems.
We also evaluated all the methods on a variant of Ackley5C with each categorical inputs able to choose between only discrete values. The results in both sequential and batch settings are shown in Figure 7. Again, our CoCaBO methods do very well against the benchmark methods.
e.2 Sequential results for realworld problems
We also evaluated the optimisation performance of all methods on SVMBoston and XGMNIST in the sequential setting . The mean and standard error over random repetitions are presented in Figure 8. CoCaBO methods perform very competitively, if not better than, all other counterparts on these synthetic problems.