Abstract
Bayesian optimization (bo) is a global optimization strategy designed to find the minimum of an expensive blackbox function, typically defined on a continuous subset of , by using a Gaussian process (gp) as a surrogate model for the objective. Although currently available acquisition functions address this goal with different degree of success, an overexploration effect of the contour of the search space is typically observed. However, in problems like the configuration of machine learning algorithms, the function domain is conservatively large and with a high probability the global minimum does not sit the boundary. We propose a method to incorporate this knowledge into the searching process by adding virtual derivative observations in the gp at the borders of the search space. We use the properties of gp s to impose conditions on the partial derivatives of the objective. The method is applicable with any acquisition function, it is easy to use and consistently reduces the number of evaluations required to optimize the objective irrespective of the acquisition used. We illustrate the benefits our approach in an extensive experimental comparison. ^{†}^{†}Work done while Javier González was at the University of Sheffield
Correcting boundary overexploration deficiencies in Bayesian optimization with virtual derivative sign observations
Eero Siivola, Aki Vehtari, Jarno Vanhatalo, Javier González Aalto University, Dept. of Computer Science, {eero.siivola, aki.vehtari}@aalto.fi University of Helsinki, Dept. of Math. and Stat., and Dept. of Biosciences, jarno.vanhatalo@helsinki.fi Amazon.com, gojav@amazon.com
1 Introduction
Global optimization is a common problem in a very broad range of applications. Formally, it is defined as finding such that
(1) 
where is generally considered to be a bounded set. In this work, we focus on cases in which is a blackbox function whose explicit form is unknown and that it is expensive to evaluate. This implies that we need to find the with a finite, typically, small number of evaluations, which transform the original optimization problem in a sequence of decision problems.
Bayesian optimization of blackbox functions using Gaussian Processes (gp s ) as surrogate priors has become popular in recent years (see, e.g., review by Shahriari et al., 2016b). It has been proven to be an efficient method to treat the decision problems of where to evaluate using statistical inference. This is typically done by means of an acquisition function that change after each point is collected and that balances exploration and exploration in the search.
A common problem that has not been systematically studied in the bo literature is the tendency of most acquisition strategies to overexplore the boundary of the function domain . This issue is not relevant if the global minimum may lie on the border of the search space but in most cases, including when the search space is unbounded (Shahriari et al., 2016a), this is not the case. This effect has also been observed in the active learning literature (Krause and Guestrin, 2007) and it is known to appear when the search is done myopically, as it is the case in most acquisitions functions. Nonmyopic approaches in bo (González et al., 2016; Osborne et al., 2009) can potentially deal with this problem but they are typically very expensive to compute.
In this paper we propose a new approach to correct the boundary overexploration effect of most acquisitions without having to assume the computational overhead of currently available nonmyopic methods. We demonstrate that when the global minimum is known not to lie on the boundary of , this information can be embedded in the problem. In these cases, it holds that the gradient of the underlying function points towards the centre on all borders. This property can be used by including additional prior information about the shape of the function by means of virtual derivative observations. In other words, we use nonobserved partial derivative data to support our prior assumption about the shape of the underlying function (its gradients on the boundary of the domain). As the derivative of a gp is also a gp (O’Hagan, 1992), including virtual derivative observations in gp s is feasible with standard inference methods. As it is shown in this work, this reduces the number of required function evaluations giving rise to a battery of more efficient bo methods.
The unwanted boundary overexploration effect of the regular bo is illustrated in Figure 1 (a). A simple function consisting of two Gaussian components is optimized with the standard bo and the proposal of this work. The correction of the overexploration effect is evident.
1.1 Related Work
Derivative observations have been used before in the bo and gp context. Koistinen et al. (2016, 2017) use Gaussian processes with derivative observations for Bayesian optimization (bo) of minimum energy path transitions of atomic rearrangements and Wu et al. (2017) use derivative observations to decrease the number of observations needed for finding the function optimum.
Derivative observations can be used to provide shape priors. To constrain a function to have a mode in a specified location Gosling et al. (2007) add virtual observations of first derivative being zero and the second derivative being negative. Riihimäki and Vehtari (2010) use virtual derivative observations, where only the sign of the derivative is known, to add monotonicity information. Jauch and Peña (2016) use in similar way virtual derivative observations for convexity and quasiconvexity constraints. To handle inference for the nonGaussian contribution of the derivative sign information, Gosling et al. (2007), Riihimäki and Vehtari (2014), and Jauch and Peña (2016) use rejection sampling, Riihimäki and Vehtari (2010) use expectation propagation (ep), and Wang and Berger (2016) use Markov chain Monte Carlo (mcmc). Unlike rejection and projection based shape constraints, monotonicity via virtual observations scales moderately well with a number of dimensions as shown by Riihimäki and Vehtari (2010) and Siivola et al. (2016).
Surprisingly, overexploration of boundaries has not been systematically studied before. A naive approach is to use a quadratic mean function to penalize the search in the boundary. However, this has strong limitations when the optimized function is multimodal or far away from being quadratic as demonstrated in Riihimäki and Vehtari (2010).
1.2 Contributions
The main contributions of this work are:

A new approach for Bayesian optimization that corrects the overexploration of the boundary of most acquisitions. The method is simple to use, can be combined with any acquisitions and always work equally or better than the standard approach. After a review of the needed background, the method is described in Section 3.

A publicly available code framework available at [to be disclosed after publication] that contains an efficient implementation of the methods described in this work

An exhaustive analysis of the performance of the proposed method in a variety of scenarios that should give the reader a precise idea about the (i) the loss in efficiency incurred in standard methods due to the boundary overexploration and (ii) how this issue is significantly relieved with our proposal.
In addition to the previous points Section 5 contains a discussion of future directions and the main lessons learned in this work.
2 Background and Problem SetUp
The main iterative steps of any bo algorithm are: (i) Model the objective function with gp prior, which is updated with the selected evaluations so far. (ii) Use an acquisition function, that depends on the posterior for the objective function, to decide what the next query point should be. Next, we briefly visit both of them and detail how information from derivative observations can be naturally incorporated in the loop.
2.1 Standard gp Surrogate for Modeling of
At iteration , we assume that we have evaluated the objective function times providing us the data where is, the possibly noisy, function evaluation at input location . To combine our previous knowledge about with the dataset , we use a gp to model. In particular, a gp prior is directly specified on the latent function with prior assumptions encoded in the covariance function , which specifies the covariance of two latent function values and . A zero mean Gaussian process prior
(2) 
is chosen, where is a covariance matrix between latent values at input used for training, , s.t. .
In regression, noisy observations and latent function values at the test inputs are assumed to have Gaussian relationship. With the noise variance , the covariance between the latent values at the training and test inputs , the covariance matrix of the latent values at the test inputs and dimensional identity matrix , the joint distribution of the observations and latent values at the test inputs is
(3) 
Using the Gaussian conditioning rule, the predictive distribution becomes with
so the predictive distribution of the gp can be written explicitly for any point in the domain.
2.2 Acquisition Policies
Among others, some very well established acquisition functions are available. The expected Improvement (ei) maximizes the expected gain over the current best (Jones et al., 1998). The lower confidence bound (lcb) minimizes the regret over the optimization area (Srinivas et al., 2010) Brochu E., Cora V. M., and de Freitas N. (2010). The probability of improvement (mpi), selects the next point where probability of improving over the current best is the highest (Kushner, 1964). Although these are some of the most widely used acquisition functions, they all suffer from the over exploration effect described in the introduction of this work. As we will detail later, the ability of gp s to handle derivative observations will be key to correct this effect. Next, we revisit the main elements to incorporate gradient information in the gp s before we detail the specifics of our proposal.
2.3 Incorporating Partial Derivative Observations in the Loop
Since the differentiation is a linear operator, the partial derivative of a Gaussian process remains a Gaussian process (Solak et al., 2003; Rasmussen and Williams, 2006). Thus, using partial derivative values for prediction and making predictions about the partial derivatives at a given point is easy to incorporate in the model and therefore in the bo search. Since
covariance matrices in Equations (2) and (3) can be extended to include partial derivatives either as observations or as values to be predicted.
Following Riihimäki and Vehtari (2010), denote by the partial derivative value in the dimension at . Then the probability of observing partial derivative is modelled using probit likelihood with a control parameter (see details in Riihimäki and Vehtari, 2010)
(4) 
where
Let be a vector of partial derivative values at , be a vector of the dimensions of the partial derivatives and be the vector of latent values at . For shorthand, let the partial derivatives of latent values be
Assuming conditional independence given the latent derivative values, the likelihood becomes
With function values at and partial derivative values at , the joint prior for and then becomes
The joint posterior for the latent values and the latent value derivatives can be derived from the Bayes’ rule
(5) 
with Note that since is not Gaussian, the full posterior is analytically intractable and some approximation method must be used. Following Riihimäki and Vehtari (2010), we use expectation propagation (ep) for fast and accurate approximative inference.
Model comparison is often done with the energy function, or negative log marginal posterior likelihood of the data
If we are interested in only some part of the model, selected points can be used to evaluate the model fit
(6)  
Depending on the choice of covariance function, gp s have various amount of parameters that must be selected. These hyperparameters include also the noise variance of the assumed Gaussian relationship between the observations and latent values. This topic of hyperparameter optimization is not covered here but the reader is further advised to see, for example, Rasmussen and Williams (2006) and Snoek et al. (2012).
3 Bayesian Optimization With Virtual Derivative Sign Observations
In this section we illustrate how, virtual derivative observations can be added to the edges of the search space. In essence, the model used is the same one described in Section 2.3 but where observed derivative observations are replaced by ’virtual’ ones at the boundaries of the domain to correct the previously described overexploration effect.
3.1 Virtual DerivativeBased Search
To encode the prior knowledge of the nonborder minimum to the optimization algorithm, we propose the following dynamic approach. Just like in the regular bo with gp prior presented in the Section 2, the objective function is given a gp prior which is updated according to the objective function evaluations so far. The next evaluation point is the acquisition function maximum, but if it is closer than threshold to the border of the search space, the point is projected to the border and a virtual derivative observation is placed to that point instead. After having added this virtual observation, the gp prior is updated and new proposal for the next acquisition is computed. In theory we could add virtual observations statically to the borders before starting the optimization, but to reduce the computational cost we use lazy addition of the virtual observations only when and where it is needed. Algorithm 1 contains pseudo code for the proposed method.
Since there is no actual knowledge about the true value of the partial derivatives on the borders, posterior distribution should behave robustly to deviations in the actual partial derivative values. By using very close to zero in Equation (4), the probit likelihood becomes very steep. This means that the likelihood values are close to for all partial derivative values and close to zero for all . Thus the assumed numeric partial derivative values do not matter and for instance can be used to express prior information on the borders. The effect of adding a virtual derivative observations on the borders of one dimensional function is visualized in the Figure 2. From the Figure it can be seen that the virtual derivative observations alter both the gp prior and the acquisition functions to resemble our prior belief of the location of the minimum.
Another parameter to be chosen is the threshold . As acquisitions closer than threshold are always rejected, should not be too large. Another argument to avoid too large values is the fading information value of the virtual observations. If the gp allows rapid changes in the latent values, virtual derivative observations affect the posterior distribution only very locally and the added observations might affect to the posterior distribution only very marginally. However, if the threshold is too small, virtual observations are added very seldom and the algorithm differs only a little from the standard bo. Let be the diameter of the search space. Our experiments suggest that provides a good tradeoff for adaptivity.
Basic time complexity of gp is , where is the number of training samples and is the number of partial derivative observations. If the search space is restricted to be a hypercube, partial derivative only in one direction is needed when a virtual observation is added on the edge of the search space, which helps to keep the number of partial derivative observations relatively small. Naturally various sparse approximations could be used to improve the scaling of the gp computation, but here we focus on problems where we assume that acquisitions are costly and is moderate. Thus, the time complexity of the algorithm itself is negligible compared to the time of acquisitions.
3.2 Adaptive Search
For some practical applications, the presented algorithm might make too strict assumptions about the underlying function not having minima at the borders. Especially with functions having multiple minima, local minima might be on the border. For better local accuracy, the following modifications to Algorithm 1 can be used.
Before placing a virtual derivative observation on the border, it can be checked whether or not the existing data supports the virtual gradient sign observation to be added to the model. This can be done by checking the energy values (Equation (6)) of virtual observations of different directions, . Since the probit likelihood (4) is close to step function, energy function values can be used to approximate posterior probability of derivative sign, (e.g. ). If the energy of the assumed derivative sign is smaller than of opposite direction, it is reasonable to add the virtual observation.
Since virtual gradient observations tell only the direction of the gradient, they do not affect as much to the uncertainty of a gp as regular observations. As a result, if there are minima on border, acquisitions might be proposed to locations where already exists virtual observations. If this happens, it is reasonable to remove the virtual observation before adding the new acquisition.
4 Experiments
In this section we introduce the four case studies performed to gain insight about the performance of the proposed method. First the details of the experiments are presented and then each study and its results are presented.
4.1 Experimental SetUp
The proposed Bayesian optimization algorithm was implemented in GPy toolbox ^{1}^{1}1Toolbox available at: https://sheffieldml.github.io/GPy/. In all case studies to be covered later, we use zero mean gp prior for regular observations and probit likelihood with scaling parameter for the virtual derivative observations. In addition to this, we use the squared exponential covariance function,
where is the input space size, is magnitude and is characteristic length scale. The partial derivatives are
which are needed to compute the virtual derivatives.
Initial acquisitions are generated with full factorial design (Croarkin et al., 2013). In the experiments, we use points to initialize the gp (see 5.3.3.3. from the citation).
Three bo algorithms are used in the case studies. Standard bo algorithm (referred as vanilla bo, vbo), bo algorithm with virtual derivative sign observations (referred as derivative bo, dbo) and adaptive version of dbo (referred as adbo). For the last two of these, virtual derivative sign observations are added if the next proposed point is within of the length of the edge of the search space to any border. For adbo, old virtual derivative observations are removed before adding regular observation if the euclidean distance between the points is less than of the length of the edge of the search space.
4.2 Case Study 1: A Simple Example Function
The algorithm is used to illustrate the unwanted boundary overexploration effect of the regular bo. To show this, simple function consisting of two Gaussian components is optimized with vbo and dbo using lcb as an acquisition function. The function and 15 first acquisitions are visualized in Figure 1, in the introduction.
The results show that vbo has sampled seven points, and the proposed method has sampled four points from near the borders. Furthermore, vbo has only one acquisition from near the true minimum, whereas dbo has eight acquisitions.
4.3 Case Study 2: Random Multivariate Normal Distribution Functions
The algorithms are used to find the minimum of different dimensional multivariate normal distribution (mnd) functions:
where the means are selected uniformly at random such that , covariances are a random positive semidefinite matrices which eigenvalues are selected uniformly at random such that . To simulate real life observations, random noise is added to the observations , for some fixed . Two example functions, without noise, are visualized in Figure 3.
We drew 100 mnd functions in each dimension as , for three different noises . For all these 400 functions, we ran 35 iterations for the three bo algorithms for all three acquisition functions, ei, lcb and mpi. 25, 50, and 75 percentiles of found minimum values of these functions as a function of iterations are illustrated in Figure 4 for all acquisition functions with . The Figure only displays each optimization run until it converges to the minimum. Results for other noise levels can be found from supplementary material.
The results show that in all dimensions for all acquisition functions, performances of dbo and adbo are better than or equal to the performance of vbo. Particularly with acquisition functions ei and lcb, dbo converges to the global minimum significantly faster than vbo, especially if the function is corrupted by noise. The performance difference between adbo and vbo is not as big, but still existing. The results also show that the variance of the optimization performance between different optimization runs is notably smaller for dbo and adbo than for vbo.
The results can better be understood from the known properties of the presented acquisition functions. lcb is known to be most exploratory and thus it samples most acquisitions from the borders of the search space. As the presented algorithms prevent this, more acquisitions are taken further from the borders forcing the optimization towards the real minimum. The same reason also explains why the DBO and ADBO results are better if the function is corrupted by noise, as with noisy functions the uncertainty increases near the borders resulting to even more acquisitions there. Another interesting phenomenon in the results is illustrated with 4 dimensional functions for ei and mpi and for and . For these acquisition functions, the latent gp function fails to model the underlying function and does not converge to the minimum. This problem is common with bo that is initialised with small amount of samples.
4.4 Case Study 3: Sigopt Function Library
Dewancker et al. (2016) introduced a benchmark function library^{2}^{2}2Function library available at: https://github.com/sigopt/evalset called Sigopt for evaluating bo algorithms. This library consists of functions of varying dimensionality and each of which belonging to one or more groups like unimodal, discrete or oscillatory functions. In addition to the functions, the library also defines the search space and true minimum of these functions. When restricting the functions to at maximum 4 dimension and taking into account only unimodal, nondiscrete functions with no global border minima or large plateaus, the library outputs 48 functions. For these functions, the number of functions per dimension varies between 10 and 14. As in the previous case study, to mimic real use cases, the function observations are corrupted with additive random noise , for . Two example functions, without noise, are visualized in Figure 5.
Similarly as in the previous example, for all these functions from Sigopt dataset, we ran 35 iterations for the proposed and baseline methods for all acquisition functions. 25, 50, and 75 percentiles of found minimum values of these functions as a function of iterations are illustrated in Figure 6 for all acquisition functions with . The Figure only displays each optimization run until it converges to the minimum. Results for other noise levels can be found from supplementary material.
The results are similar as for mnd functions. dbo and adbo still perform better than vbo, especially with lcb and ei. There is not as much difference between adbo and vbo as before. Similarly as before, the variance of the optimization performance between different optimization runs is notably smaller for dbo and adbo than for vbo. Since there are less functions per dimension, the overall variability in the results is bigger and the percentile curves are not as smooth.
4.5 Case Study 4: Simple Gaussian Functions With Minima on Border
The algorithms are used to find minimum of similar Gaussian functions as in Section Case Study 2: Random Multivariate Normal Distribution Functions, with the difference that the global minima of each function is exactly on the border of the search space. Otherwise, the shape of the functions is similar as illustrated in Figure 3. The purpose of this case study is to show what happens to the performance of the proposed method if the a priori assumption is violated.
We drew 100 mnd functions in each dimension as , for three different noises . For all these 400 functions, we ran 35 iterations for the three bo algorithms for all three acquisition functions, ei, lcb and mpi. 25, 50 and 75 percentiles of found minimum values of these functions as a function of iterations are illustrated in Figure 7 for all acquisition functions with . The Figure only displays each optimization run until it converges to the minimum. Results for other noise levels can be found from supplementary material.
As expected, the results show that dbo does not perform as well as vbo and adbo. The performances of adbo and vbo are very similar for all noise levels and dimensions for lcb and mpi, for ei it performs similar as dbo. It is surprising that for lcb, dbo performs as weel as vbo.
5 Conclusions
We have presented here a Bayesian optimization algorithm which utilizes qualitative prior information concerning the objective function on the borders. Namely, we assume that the gradient of the underlying function points towards the centre on all borders. Typical uses of Bayesian optimization concern expensive functions and in many applications qualitative knowledge of the generic properties of the function are known prior to optimization. Our approach generalizes naturally to these settings so that the user could inform the gp prior about the direction of gradient at any location of the search space.
The proposed bo method has proved to significantly improve the optimization speed and the found minimum when comparing the average performance to the performance of the standard bo algorithm without virtual derivative sign observations. The difference in performance is more significant if the assumption of nonexistent global or local minima on the border of the search space holds, but is still notable if the assumption is relaxed so that the global minimum is not located on the border.
Acknowledgments
We thank Michael Andersen and Juho Piironen for their valuable comments to improve the manuscript. We also acknowledge the computational resources provided by Aalto University’s Department of Computer Science.
References
References
 Brochu E., Cora V. M., and de Freitas N. (2010) Brochu E., Cora V. M., and de Freitas N. (2010). A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599.
 Croarkin et al. (2013) Croarkin, C., Tobias, P., and Zey, C. (2013). Engineering statistics handbook. NIST iTL.
 Dewancker et al. (2016) Dewancker, I., McCourt, M., Clark, S., Hayes, P., Johnson, A., and Ke, G. (2016). A stratified analysis of Bayesian optimization methods. arXiv preprint arXiv:1603.09441.
 González et al. (2016) González, J., Osborne, M. A., and Lawrence, N. D. (2016). GLASSES: relieving the myopia of Bayesian optimisation. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 911, 2016, pages 790–799.
 Gosling et al. (2007) Gosling, J. P., Oakley, J. E., and O’Hagan, A. (2007). Nonparametric elicitation for heavytailed prior distributions. Bayesian Analysis, 2(4):693–718.
 Jauch and Peña (2016) Jauch, M. and Peña, V. (2016). Bayesian optimization with shape constraints. arXiv preprint arXiv:1612.08915.
 Jones et al. (1998) Jones, D. R., Schonlau, M., and Welch, W. J. (1998). Efficient global optimization of expensive blackbox functions. Journal of Global Optimization, 13(4):455–492.
 Koistinen et al. (2016) Koistinen, O., Maras, E., Vehtari, A., and Jónsson, H. (2016). Minimum energy path calculations with Gaussian process regression. Nanosystems: Physics, Chemistry, Mathematics, 7(6):925–935.
 Koistinen et al. (2017) Koistinen, O.P., Dagbjartsdóttir, F. B., Ásgeirsson, V., Vehtari, A., and Jónsson, H. (2017). Nudged elastic band calculations accelerated with gaussian process regression. Journal of Chemical Physics, 147(152720).
 Krause and Guestrin (2007) Krause, A. and Guestrin, C. (2007). Nonmyopic active learning of Gaussian processes: an explorationexploitation approach. In Proceedings of the 24th International Conference on Machine Learning, volume 227 of ACM International Conference Proceeding Series, pages 449–456. ACM.
 Kushner (1964) Kushner, H. J. (1964). A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86(1):97–106.
 O’Hagan (1992) O’Hagan, A. (1992). Some Bayesian numerical analysis. In Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A. F. M., editors, Bayesian statistics 4, pages 345–363. Oxford University Press.
 Osborne et al. (2009) Osborne, M. A., Garnett, R., and Roberts, S. J. (2009). Gaussian processes for global optimization. In 3rd international conference on learning and intelligent optimization (LION3), pages 1–15.
 Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian processes for machine learning. The MIT Press, 2(3):4.
 Riihimäki and Vehtari (2010) Riihimäki, J. and Vehtari, A. (2010). Gaussian processes with monotonicity information. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pages 645–652.
 Riihimäki and Vehtari (2014) Riihimäki, J. and Vehtari, A. (2014). Laplace approximation for logistic Gaussian process density estimation and regression. Bayesian Analysis, 9(2):425–448.
 Shahriari et al. (2016a) Shahriari, B., BouchardCôté, A., and de Freitas, N. (2016a). Unbounded Bayesian optimization via regularization. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 1168–1176.
 Shahriari et al. (2016b) Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. (2016b). Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175.
 Siivola et al. (2016) Siivola, E., Piironen, J., and Vehtari, A. (2016). Automatic monotonicity detection for Gaussian processes. arXiv preprint arXiv:1610.05440.
 Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pages 2951–2959.
 Solak et al. (2003) Solak, E., Murray, S. R., Leithead, W. E., Leith, D. J., and Rasmussen, C. E. (2003). Derivative observations in Gaussian process models of dynamic systems. In Advances in Neural Information Processing Systems, pages 1033–1040.
 Srinivas et al. (2010) Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. Proceedings of the 27th International Conference on Machine Learning.
 Wang and Berger (2016) Wang, X. and Berger, J. O. (2016). Estimating shape constrained functions using Gaussian processes. SIAM/ASA Journal on Uncertainty Quantification, 4(1):1–25.
 Wu et al. (2017) Wu, J., Poloczek, M., Wilson, A. G., and Frazier, P. I. (2017). Bayesian optimization with gradients. arXiv preprint arXiv:1703.04389.