Abstract
We propose a novel, theoreticallygrounded, acquisition function for batch Bayesian optimization informed by insights from distributionally ambiguous optimization. Our acquisition function is a lower bound on the wellknown Expected Improvement function – which requires a multidimensional Gaussian Expectation over a piecewise affine function – and is computed by evaluating instead the bestcase expectation over all probability distributions consistent with the same mean and variance as the original Gaussian distribution. Unlike alternative approaches including Expected Improvement, our proposed acquisition function avoids multidimensional integrations entirely, and can be computed exactly as the solution of a convex optimization problem in the form of a tractable semidefinite program (SDP). Moreover, we prove that the solution of this SDP also yields exact numerical derivatives, which enable efficient optimization of the acquisition function. Finally, it efficiently handles marginalized posteriors with respect to the Gaussian Process’ hyperparameters. We demonstrate superior performance to heuristic alternatives and approximations of the intractable expected improvement, justifying this performance difference based on simple examples that break the assumptions of stateoftheart methods.
1 Introduction
When dealing with numerical optimization problems in engineering applications, one is often faced with the optimization of an expensive process that depends on a number of tuning parameters. Examples include the outcome of a biological experiment, the training of large scale machine learning algorithms or the outcome of exploratory drilling. We are concerned with problem instances wherein there is the capacity to perform experiments in parallel and, if needed, repeatedly select further batches with cardinality as part of some sequential decision making process. Given the cost of the process, we wish to select the parameters at each stage of evaluations carefully so as to optimize the process using as few experimental evaluations as possible.
2 Bayesian optimization
It is common to assume a surrogate model for the outcome of the process to be optimized. This model, which is built based on both prior assumptions and past function evaluations, is used to determine a collection of input points for the next set of evaluations. Bayesian Optimization provides an elegant surrogate model approach and has been shown to outperform other stateoftheart algorithms on a number of challenging benchmark functions (Jones, 2001). It models the unknown function with a Gaussian Process (GP) (Rasmussen and Williams, 2005), a probabilistic function approximator which can incorporate prior knowledge such as smoothness, trends, etc.
A comprehensive introduction to GPs can be found in (Rasmussen and Williams, 2005). In short, modeling a function with a GP amounts to modelling the function as a realization of a stochastic process. In particular, we assume that the outcomes of function evaluations are normally distributed random variables with known prior mean function and prior covariance function . Prior knowledge about , such as smoothness and trends, can be incorporated through judicious choice of the covariance function , while the mean function is commonly assumed to be zero. A training dataset of past function evaluations for , with is then used to calculate the posterior of .
The GP regression equations (Rasmussen and Williams, 2005) give this posterior on a batch of test locations as
(1) 
with
(2)  
(3) 
where is a matrix containing the prior covariances between every pair of training and test points generated by (similarly for ). The posterior mean value and variance depend also on the dataset , but we do not explicitly indicate this dependence in order to simplify the notation. Likewise, the posterior is a normally distributed random variable whose mean and covariance depend on , but we do not make this explicit.
Given the surrogate model, we wish to identify some selection criterion for choosing the next batch of points to be evaluated. Such a selection criterion is known as an acquisition function in the terminology of Bayesian Optimization. Ideally, such an acquisition function would take into account the number of remaining evaluations that we can afford, e.g. by computing a solution via dynamic programming to construct an optimal sequence of policies for future batch selections. However, a probabilistic treatment of such a criterion is computationally intractable, involving multiple nested minimizationmarginalization steps (GonzÃ¡lez et al., 2016).
To avoid this computational complexity, myopic acquisition functions that only consider the onestep return are typically used instead. For example, one could choose to minimize the onestep Expected Improvement (described more fully in §3) over the best evaluation observed so far, or maximize the probability of having an improvement in the next batch over the best evaluation. Other criteria use ideas from the bandit (Desautels et al., 2014) and information theory (Shah and Ghahramani, 2015) literature. In other words, the intractability of the multistep lookahead problem has spurred instead the introduction of a wide variety of myopic heuristics for batch selection.
3 Expected improvement
We will focus on the (onestep) Expected Improvement criterion, which is a standard choice and has been shown to achieve good results in practice (Snoek et al., 2012). In order to give a formal description we first require some definitions related to the optimization procedure of the original process. At each step of the optimization define as the vector of past function values evaluated at the points , and as a candidate set of points for the next batch of evaluations. Then the expected improvement acquisition function is defined as
(4) 
where is the elementwise minimum of , i.e. the minimum value of the function achieved so far by any known function input. Selection of a batch of points to be evaluated with optimal expected improvement then amounts to finding some
Unfortunately, direct evaluation of the acquisition function requires the –dimensional integration of a piecewise affine function; this is potentially a computationally expensive operation. This is particularly problematic for gradientbased optimization methods, wherein may be evaluated many times when searching for a minimizer. Regardless of the optimization method used, such a minimizer must also be computed again for every step in the original optimization process, i.e. every time a new batch of points is selected for evaluation. Therefore a tractable acquisition function should be used. In contrast to (4), the acquisition function we will introduce in section 4 avoids expensive integrations and can be calculated efficiently with standard software tools.
Despite these issues, Chevalier and Ginsbourger (2013) presented an efficient way of approximating and its derivative (Marmin et al., 2015) by decomposing it into a sum of –dimensional Gaussian Cumulative Distributions, which can be calculated efficiently using the seminal work of Genz and Bretz (2009). There are two issues with this approach: First the number of calls to the –dimensional Gaussian Cumulative Distribution grows quadratically with respect to the batch size , and secondly, there are no guarantees about the accuracy of the approximation or its gradient. Indeed, approximations of the multipoint expected improvement, as calculated with the R package DiceOptim (Roustant et al., 2012) can be shown to be arbitrarily wrong in trivial lowdimensional examples (see Figure 3). To avoid these issues, Gonzalez et al. (2016) and Ginsbourger et al. (2009) rely on heuristics to derive a multipoint criterion. Both methods choose the batch points in a greedy, sequential way, which restricts them from exploiting the interactions between the batch points in a probabilistic manner.
4 Distributionally ambiguous optimization for Bayesian optimization
We now proceed to the main contribution of the paper, which draws ideas from the Distributionally Ambiguous Optimization community to derive a novel, tractable acquisition function which lower bounds the expectation in (4). Our acquisition function:

is theoretically grounded;

is numerically reliable and consistent, unlike Expected Improvementbased alternatives (see §6);

is fast and scales well with the batch size;

provides gradients as a direct byproduct of the function evaluation; and

handles efficiently marginalized posteriors with respect to the GP’s hyperparameters
In particular, we use the posterior result (1)–(2) derived from the GP to determine the mean and variance of given a candidate batch selection , but we thereafter ignore the Gaussian assumption and consider only that has a distribution embedded within a family of distributions that share the mean and covariance calculated by (2) and (2). In other words, we define
We denote the set simply as where the context is clear. Note in particular that for any choice of mean or covariance .
One can then construct upper and lower bound for the Expected Improvement by maximizing or minimizing over the set , i.e. by writing
(5) 
where the random vector and the function are chosen such that , i.e. and
(6) 
Observe that the middle term in (5) is equivalent to the expectation in (4).
Perhaps surprisingly, both of the bounds in (5) are computationally tractable even though they seemingly require optimization over the infinitedimensional (but convex) set of distributions . For either case, these bounds can be computed exactly via transformation of the problem to a tractable, convex semidefinite optimization problem use distributionally ambiguous optimization techniques (Zymler et al., 2013).
This computational tractability comes at the cost of inexactness in the bounds (5), which is a consequence of minimizing (or maximizing) over a set of distributions containing the Gaussian distribution as just one of its members. We show in §6 that this inexactness is of limited consequence in practice. Nevertheless, there remains significant scope for tightening the bounds in (5) via imposition of additional convex constraints on the set , e.g. by restricting to symmetric or unimodal distributions (Parys et al., 2015). Most of the results in this work would still apply, mutatis mutandis, if such structural constraints were included.
In contrast, the distributionally ambiguity is useful for integrating out the uncertainty of the GP’s hyperparameters efficiently. To see this, first define the second order moment matrix of the posterior as
(7) 
which we will occasionally write as to highlight the dependency of the second order moment matrix on . Given samples of the hyperparameters, resulting in second order moment matrices , we can estimate the resulting second moment matrix of the marginalized, nonGaussian, posterior as
Due to the distributionally ambiguity of our approach, both bounds of (5) can be calculated directly based on , hence avoiding multiple calls to the acquisition function.
We will focus on the lower bound in (5), hence adopting an optimistic modelling approach.
The following result says that the lower (i.e. optimistic) bound in (5) can be computed via the solution of a convex optimization problem whose objective function is linear in :
The optimal value of the semiinfinite optimization problem
coincides with the optimal value of the following semidefinite program:
sup  ()  
s.t. 
where is the decision variable and
(8)  
are auxiliary matrices defined using and the standard basis vectors in .
Proof.
See Appendix A. ∎
Problem () is a semidefinite program (SDP). SDPs are convex optimization problems with a linear objective and convex conic constraints (i.e. constraints over the set of symmetric matrices , positive semidefinite/definite matrices /). Hence it can be solved to global optimality with standard software tools (Sturm, 1999; O’Donoghue et al., 2016a). We therefore propose the computationally tractable acquisition function
which is an optimistic variant of the Expected Improvement function in (4).
Although the value of for any fixed is computable via solution of an SDP, the complex dependence (2)–(2) that defines the mapping means that is still nonconvex in . This is unfortunate, since we ultimately wish to minimize in order to identify the next batch of points to be evaluated experimentally.
However, it is still possible to compute a local solution via local optimization if an appropriate gradient can be computed. The next result establishes that this is always possible provided that : {theorem} is differentiable on with , where is the unique optimal solution of () at .
The preceding result shows that is produced as a byproduct of evaluation of , since it is simply the unique optimizer of (). Hence we can evaluate via using the optimal solution and application of the chain rule, i.e.
Note that the second term in the rightmost inner product above depends on the particular choice of covariance function . This computation is standard for many choices of covariance function, and many standard software tools, including GPy and GPflow, provide the means to compute efficiently given .
We are now in a position to present an algorithm that summarizes our proposed Bayesian Optimisation method using our proposed acquisition function. In the sequel we will present timing and numerical results to demonstrate the validity of this algorithm.
5 Timing
Our acquisition function is evaluated via solution of a semidefinite program, and as such it benefits from the huge advances of the convex optimization field. A variety of standard tools exist for solving such problems accurately, such as the commercial secondorder solver MOSEK (MOSEK, ). However, although second order solvers require few iterations and give very accurate results they are not applicable on very large problems (Boyd and Vandenberghe, 2004). On the other hand, firstorder solvers, such as SCS (O’Donoghue et al., 2016b) or CDCS (Zheng et al., 2017), scale well with the problem size, are amenable to parallelism, and implementations on graphbased GPU acceleration frameworks have recently been suggested (Wytock et al., 2016). They also allow for solver warmstarting between acquisition function evaluations, which results in significant speedup for our case since we repeatedly evaluate the acquisition function at nearby points to perform gradientbased optimization. In total, for the challenging optimization of the 5d Alpine1 function, the calculation of OEI and its derivative is (on average) 10, 100 and 1000 times faster than DiceOptim’s QEI derivative for batch sizes of 2, 7, and 16 respectively, when using SCS as a solver with warm starting.
6 Empirical analysis
Key  Description  

OEI 


QEI 


QEICL 


LPEI 


BLCB 

In this section we demonstrate the effectiveness of our acquisition function against a number of stateoftheart alternatives. The acquisition functions we consider are listed in Table 1. In the implementation of our own acquisition function, OEI, we produce both acquisition function values and gradients by solving the semidefinite program () using the solver MOSEK (MOSEK, ) accessed through the Pythonbased convex optimization package CVXPY (Diamond and Boyd, 2016).
We show that OEI achieves better performance than alternatives and highlight simple failure cases exhibited by competing methods. In making the following comparisons, extra care should be given in the setup used. This is because Bayesian Optimisation is a multifaceted procedure that depends on a collection of disparate elements (e.g. kernel/mean function choice, normalization of data, acquisition function, optimization of the acquisition function) each of which can have a considerable effect on the resulting performance (Snoek et al., 2012). For this reason we test the different algorithms on a unified testing framework, based on GPflow, which will be released upon acceptance, where all the elements are the same unless stated otherwise.
We first demonstrate that the competing Expected Improvement based algorithms produce clearly suboptimal choices in a simple 1dimensional example. We consider an 1d Gaussian Process on the interval with a squared exponential kernel (Rasmussen and Williams, 2005) of lengthscale , variance , noise and a mean function . An example posterior of 10 observations is depicted in Figure 1. Given the GP and the 10 observations, we depict the optimal 3point batch chosen by minimizing each acquisition function. Note that in this case we assume perfect modelling assumptions: the GP is completely known and representative of the actual process. We can observe in Figure 1 that the OEI choice is very close to the one proposed by QEI while being slightly more explorative, as OEI allows for the possibility of more exotic distributions than the Gaussian.
In contrast the LPEI heuristic samples three times almost at the same point. This can be explained as follows: LPEI is based on a Lipschitz criterion to penalize areas around previous choices. However, the Lipschitz constant for this function is dominated by the end points of the function (due to the quadratic trend), which is clearly not suitable for the area of the minimizer (around zero), where the Lipschitz constant is approximately zero. On the other hand, QEICL favors suboptimal regions. This is because QEICL postulates outputs equal to the mean value of the observations which significantly alter the GP posterior.
Testing the algorithms on 200 different posteriors, generated by creating sets of 10 observations drawn from the previously defined GP, suggests OEI as the clear winner. For each of the 200 posteriors, each algorithm chooses a batch, the performance of which is evaluated by sampling the multipoint expected improvement (4). The averaged results are depicted in Figure 2. For a batch size of 1 all of the algorithms perform the same, except for OEI which performs slightly worse due to the relaxation of Gaussianity. For batch sizes 23, QEI is the best strategy, while OEI is a very close second. For batch sizes 45 OEI performs significantly better. The deterioration of the performance for QEI in batch sizes 4 and 5 can be explained in Figure 3. In particular, after batch size 3, the calculation of QEI via the R package DiceOptim can become arbitrarily bad, leading to poor choices. A bug report will be submitted to the authors of DiceOptim after the end of the doubleblind review process. Figure 3 also explains the very good performance of OEI. Although always different from the sampled estimate, it is reliable and closely approximates the actual expected improvement in the sense that their optimizers and level sets are in close agreement.
We next evaluate the performance of OEI in minimizing synthetic benchmark functions. We consider a mixture of cosines defined in , the BraninHoo function, defined on , the SixHump Camel function defined on and the Hartman 6 dimensional function defined on . We compare the performance of OEI against QEI, LPEI and BLCB as well as random uniform sampling. The initial dataset consists of 5 and 20 random points for the 2d and 6d functions respectively, while the batch size is equal to 5 for all the experiments.
A squared exponential kernel with automatic relevance determination is used for the GP modelling (Rasmussen and Williams, 2005). Since none of the competitor algorithms can efficiently handle uncertainty in the hyperparameters, point estimates acquired by maximum likelihood are used for them. In contrast, the distributional ambiguity of OEI allows efficient sampling of the GP posterior to integrate out the uncertainty of hyperparameters, as explained in §4. We choose a Gamma Prior of shape and scale for the lengthscales and a Gaussian Prior with for the kernel’s variance. We then draw samples from the posterior of the hyperparameters given the data via Hamiltonian Monte Carlo. For the purpose of generality, the input domain of every function is scaled to while is scaled at each iteration, such as . The same scalings are applied for QEI, LPEI and BLCB for reasons of consistency. The acquisition functions are optimized with the quasiNewton LBFGSB algorithm (Fletcher, 1987) with 20 random restarts.
The results are presented in Figure 4. All of the methods except BLCB managed to identify the global minimum of the three twodimensional functions (Cosines, BraninHoo, SixHump Camel). In these functions, LPEI, QEI and OEI perform similarly. However, LPEI fails to run in the branin function for most of the tested seeds due to an error in the computation of the gradients of the acquisition function. A bug report will be submitted to the authors of GPyOpt (the official package for LPEI) after the end of the doubleblind review process. But, most importantly, in the challenging case of the 6dimensional Hartman function our algorithm achieves the best performance, with BLCB also demonstrating good performance in the longer run. Note that a log transformation is applied to the Hartman function, following Jones et al. (1998).
7 Conclusions
We have introduced a new acquisition function that is a tractable, probabilistic relaxation of the multipoint Expected Improvement, drawing ideas from the Distributionally Ambiguous Optimization community. Our acquisition function can be computed exactly via the solution of a semidefinite program, with its gradient available as a direct byproduct of this solution. In contrast to alternative approaches, our proposed acquisition function scales well with batch size, does not rely on heuristics (which we demonstrate can fail even in simple cases) and directly considers the correlation between batch points, achieving performance close to the actual multipoint Expected Improvement while remaining computationally tractable. Finally, its distributionally ambiguous approach allows for an effient handling of marginalized posterios with respect to the GP’s hyperparameters. Future work includes exploiting related results from the Optimization community: sensitivity analysis on Semidefinite Programming (Freund and Jarre, 2004) can provide the Hessian of the acquisition function allowing a Newtonbased method to be used in the challenging problem of minimizing the highdimensional acquisition function. Moreover, all the work presented here can be applied to Student tprocesses (Shah et al., 2014), the results of which will explored in a subsequent publication.
Acknowledgments
This work was supported by the EPSRC AIMS CDT grant EP/L015987/1 and Schlumberger.
Appendix A Value of the Expected Improvement Lower Bound
In this section we provide a proof of the first of our main results, Theorem 4, which establishes that one can compute the value of our optimistic lower bound function
(9) 
via solution of a convex optimization problem in the form of a semidefinite program.
Proof of Theorem 4:
Recall that the set is the set of all distributions with mean and covariance . Following the approach of Zymler et al. (2013), we first remodel problem (9) as:
(10)  
s.t.  
where represents the cone of nonnegative Borel measures on . The optimization problem (10) is a semiinfinite linear program, with infinite dimensional decision variable and a finite collection of linear equalities in the form of moment constraints.
As shown by Zymler et al. (2013), the dual of problem (10) has instead a finite dimensional set of decision variables and an infinite collection of constraints, and can be written as
sup  (11)  
s.t. 
with the decision variable and the second order moment matrix of (see (7)). Strong duality holds between problems (10) and (11), i.e. there is zero duality gap and their optimal values coincide.
The dual decision variables in (11) form a matrix of Lagrange multipliers for the constraints in (10) that is block decomposable as
where are multipliers for the second moment constraint, multipliers for the mean value constraint, and a scalar multiplier for the constraint that should integrate to , i.e. that should be a probability measure.
Appendix B Gradient of the Expected Improvement Lower Bound
In this section we provide a proof of our second main result, Theorem 4, which shows that the gradient of our lower bound function (5) with respect to coincides with the optimal solution of the semidefinite program ().
Before proving Theorem 4 we require two ancillary results. The first of these results establishes that any feasible point for the optimization problem () has strictly negative definite principal minors in the upper left hand corner. {lemma} For any feasible of () the upper left matrix is negative definite.
Proof.
Let
where and .
Using the definitions of for from (8), the semidefinite constraint in the optimization problem () can be expressed as
(13)  
(14)  
We can easily see that by substituting in (14). It remains to show that this matrix is actually sign definite, which amounts to showing that it is full rank.
Assume the contrary, so that there exists some nonzero satisfying . Then, (14) gives . When for some , a sufficiently small can be chosen such that and the constraint (14) is violated.
On the other hand, when we can conclude that , which means that where and . In that case (13) is violated for a sufficiently small positive . Hence is full rank by contradiction and . ∎
Our second ancillary result considers the gradient of the function when its argument is varied linearly along some direction .
Given any and any moment matrix , define the scalar function as
Then is differentiable at with , where is the optimal solution of () at .
Proof.
Define the set as
i.e. the set of all for which problem () has a bounded solution given the moment matrix . In order to prove the result it is then sufficient to show:

, and
The remainder of the proof then follows from (Goldfarb and Scheinberg, 1999, Lemma 3.3), wherein it is shown that implies that is a subgradient of at , and subsequent remarks in Goldfarb and Scheinberg (1999) establish that uniqueness of the solution ensure differentiability.
We will show that both of the conditions (i) and (ii) above are satisfied. The proof relies on the Lagrange dual of problem (), which we can write as
inf  ()  
s.t.  
(i): Proof that :
It is wellknown that if both of the primal and dual problems () and () are strictly feasible then their optimal values coincide, i.e. Slater’s condition holds and we obtain strong duality; see (Boyd and Vandenberghe, 2004, Section 5.2.3) and (Ramana et al., 1997).
For () it is obvious that one can construct a strictly feasible point. For (), defines a strictly feasible point for any . Hence () is solvable for any with sufficiently small. As a result, .
(ii): Proof that the solution to () at is unique:
We can prove uniqueness by examining the KarushKuhnTucker conditions for a pair of primal and dual solutions , :
(15)  
(16)  
(17)  
(18) 
where denotes the Lagrangian of (). As noted before, such a pair of solutions exists when . First, we will prove that and . Lemma B implies that (recall that is nonzero only in the last column or the last row), which means that . Due to the complementary slackness condition (17), the span of is orthogonal to the span of and consequently . However, according to (18) we have
(19) 
which results in
(20)  
with  
(21) 
where the final equality uses (17), and where and denote the kernel and image of a matrix, respectively.
Since for every the kernel is of dimension one, we can uniquely identify (up to a sign change) a vector in the kernel with unit norm. Moreover, according to (20) we can represent every as
(22) 
for some . The positivity of comes from the fact that is of rank one.
We next show that the set of null vectors are the same (up to a sign change) for every primaldual solution. Assume that , with is also optimal and are the unit norm null vectors of . By definition we have
(23) 
However, since is feasible we have , which results in
(24) 
Since and have the same objective value we conclude that . Moreover, according to (18) and (22) we have the conditions for some . Hence
for some , which in turn implies that
Since , this proves that is the uniquely defined unit norm null vector (up to a change in sign) of . This proves that .
Given the null vectors , the scalar values can be uniquely defined using (18), as they are the coefficients of the decomposition of the rank matrix to the rank matrices . Thus, given the uniqueness of across every solution , (22) proves uniqueness of , i.e. uniqueness of the dual solution. Now we can easily prove uniqueness for the primal solution. Summing (17) gives
Proof of Theorem 4:
Given the preceding support results of this section, we are now in a position to prove Theorem 4.
Footnotes
 It can be shown that the upper bound in (5) is trivial to evaluate and is not of practical use.
References
 S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
 Clément Chevalier and David Ginsbourger. Fast Computation of the MultiPoints Expected Improvement with Applications in Batch Selection, pages 59–69. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. ISBN 9783642449734. doi: 10.1007/9783642449734˙7.
 Thomas Desautels, Andreas Krause, and Joel W. Burdick. Parallelizing explorationexploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research, 15:4053–4103, 2014. URL http://jmlr.org/papers/v15/desautels14a.html.
 Steven Diamond and Stephen Boyd. Cvxpy: A pythonembedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
 R. Fletcher. Practical Methods of Optimization; (Second Edition). WileyInterscience, New York, NY, USA, 1987. ISBN 0471915475.
 Roland W. Freund and Florian Jarre. A sensitivity result for semidefinite programs. Oper. Res. Lett., 32(2):126–132, March 2004. ISSN 01676377. doi: 10.1016/S01676377(03)000695.
 Alan Genz and Frank Bretz. Computation of Multivariate Normal and T Probabilities. Springer Publishing Company, Incorporated, 1st edition, 2009. ISBN 364201688X, 9783642016882.
 David Ginsbourger, Rodolphe Le Riche, Laurent Carraro, and DÃ©partement Mi. A multipoints criterion for deterministic parallel global optimization based on gaussian processes. Journal of Global Optimization, in revision, 2009.
 David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. Kriging Is WellSuited to Parallelize Optimization, pages 131–162. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. ISBN 9783642107016. doi: 10.1007/9783642107016˙6.
 D. Goldfarb and K. Scheinberg. On parametric semidefinite programming. Applied Numerical Mathematics, 29(3):361 – 377, 1999. ISSN 01689274. doi: 10.1016/S01689274(98)001020.
 Javier Gonzalez, Zhenwen Dai, Philipp Hennig, and Neil Lawrence. Batch bayesian optimization via local penalization. In Arthur Gretton and Christian C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 648–657, Cadiz, Spain, 09–11 May 2016. PMLR. URL http://proceedings.mlr.press/v51/gonzalez16a.html.
 Javier GonzÃ¡lez, Michael A. Osborne, and Neil D. Lawrence. GLASSES : Relieving The Myopia Of Bayesian Optimisation. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016. URL http://arxiv.org/pdf/1510.06299v1.pdf.
 GPyOpt. A bayesian optimization framework in python. http://github.com/SheffieldML/GPyOpt, 2016.
 Donald R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21(4):345–383, 2001. ISSN 15732916. doi: 10.1023/A:1012771025575.
 Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive blackbox functions. Journal of Global Optimization, 13(4):455–492, Dec 1998. ISSN 15732916. doi: 10.1023/A:1008306431147.
 Sébastien Marmin, Clément Chevalier, and David Ginsbourger. Differentiating the Multipoint Expected Improvement for Optimal Batch Design, pages 37–48. Springer International Publishing, Cham, 2015. ISBN 9783319279268. doi: 10.1007/9783319279268˙4.
 ApS MOSEK. The MOSEK optimization toolbox for Python manual. URL http://www.mosek.com/.
 B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd. SCS: Splitting conic solver, version 1.2.6. https://github.com/cvxgrp/scs, 2016a.
 Brendan O’Donoghue, Eric Chu, Neal Parikh, and Stephen Boyd. Conic optimization via operator splitting and homogeneous selfdual embedding. Journal of Optimization Theory and Applications, 169(3):1042–1068, feb 2016b. doi: 10.1007/s1095701608923.
 Bart PG Van Parys, Paul J Goulart, and Manfred Morari. Distributionally robust expectation inequalities for structured distributions. OptimizationOnline eprints, 2015. URL http://www.optimizationonline.org/DB_HTML/2015/05/4896.html.
 Motakuri V Ramana, Levent Tunçel, and Henry Wolkowicz. Strong duality for semidefinite programming. SIAM Journal on Optimization, 7(3):641–662, 1997.
 Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005. ISBN 026218253X.
 Olivier Roustant, David Ginsbourger, Yves Deville, et al. DiceKriging, DiceOptim: Two R packages for the analysis of computer experiments by krigingbased metamodeling and optimization. Journal of Statistical Software, 51(i01), 2012.
 Amar Shah and Zoubin Ghahramani. Parallel predictive entropy search for batch global optimization of expensive objective functions. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3330–3338. Curran Associates, Inc., 2015.
 Amar Shah, Andrew Gordon Wilson, and Zoubin Ghahramani. Studentt processes as alternatives to gaussian processes. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 2225, 2014, pages 877–885, 2014. URL http://jmlr.org/proceedings/papers/v33/shah14.html.
 Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pages 2951–2959, 2012.
 J.F. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software, 11–12:625–653, 1999. Version 1.05 available from http://fewcal.kub.nl/sturm.
 Matt Wytock, Steven Diamond, Felix Heide, and Stephen Boyd. A new architecture for optimization modeling frameworks. In Proceedings of the 6th Workshop on Python for HighPerformance and Scientific Computing, PyHPC ’16, pages 36–44, Piscataway, NJ, USA, 2016. IEEE Press. ISBN 9781509052202. doi: 10.1109/PyHPC.2016.5.
 Y. Zheng, G. Fantuzzi, A. Papachristodoulou, P. Goulart, and A. Wynn. Chordal decomposition in operatorsplitting methods for sparse semidefinite programs. 2017. URL https://arxiv.org/abs/1707.05058.
 Steve Zymler, Daniel Kuhn, and Berç Rustem. Distributionally robust joint chance constraints with secondorder moment information. Mathematical Programming, 137(1):167–198, 2013. ISSN 14364646. doi: 10.1007/s1010701104947.