Abstract
We consider blackbox global optimization of timeconsumingtoevaluate functions on behalf of a decisionmaker whose preferences must be learned. Each feasible design is associated with a timeconsumingtoevaluate vector of attributes, each vector of attributes is assigned a utility by the decisionmaker’s utility function, and this utility function may be learned approximately using preferences expressed by the decisionmaker over pairs of attribute vectors. Past work has used this estimated utility function as if it were errorfree within singleobjective optimization. However, errors in utility estimation may yield a poor suggested decision. Furthermore, this approach produces a single suggested “best” design, whereas decisionmakers often prefer to choose among a menu of designs. We propose a novel Bayesian optimization algorithm that acknowledges the uncertainty in preference estimation and implicitly chooses designs to evaluate using the timeconsuming function that are good not just for a single estimated utility function but a range of likely utility functions. Our algorithm then shows a menu of designs and evaluated attributes to the decisionmaker who makes a final selection. We demonstrate the value of our algorithm in a variety of numerical experiments.
Bayesian optimization with uncertain preferences over attributes
Raul Astudillo &Peter I. Frazier \aistatsaddress Cornell University &Cornell University, Uber
1 Introduction
We begin with a motivating example: helping a cancer patient (the “decision maker”) find the best treatment. Cancer treatments exhibit a range of abilities to cure disease, side effects and financial costs (aning2012patient; wong2013cancer; marshall2016women), referred to here as “attributes”. Suppose a patient considers realvalued attributes when selecting a cancer treatment. Also suppose a timeconsumingtoevaluate blackbox computational simulator can use the patient’s medical history to compute the attributes, , of treatment . The patient has an implicit preference over these attributes and our goal is to help her find her most preferred treatment by querying our simulator.
One existing approach, pursued within preferencebased reinforcement learning (wirth2017survey), is to first learn the patient’s preferences (chu2005preference; dewancker2016; abbas2018foundations) and then optimize using the learned estimates. We call this approach “learn then optimize”. This approach asks the patient for her preference between attribute vector , corresponding to pairs of treatments , . It then learns a utility function , e.g., using preference learning with Gaussian processes (chu2005preference), such that the judgements are as consistent as possible with the estimated utility differences . It then solves using a method for optimizing timeconsumingtoevaluate blackbox functions, such as Bayesian optimization (BayesOpt) (frazier2018tutorial), assuming that the estimated utility function is correct. Optionally, if more judgements become available during optimization, these can be used to update our estimate (wirth2017survey). This approach, however, is not robust to uncertainty in preference estimates.
To better illustrate that becoming robust to uncertainty in preferences can improve performance, suppose that preference learning suggests that the patient’s true utility function is close to one of possible functions . Then, a better approach would be to offer the patient a set of treatments , where , and let her choose among them. This will provide nearoptimal utility to the patient, while optimizing for a single point estimate of the utility function will not. While this approach improves over the standard approach in the utility it provides, it requires solving optimization problems with an expensivetoevaluate objective, which becomes computationally infeasible as grows. Our approach (described below) delivers similar utility gains using fewer queries to the objective function.
Another approach, that can be used when each attribute is a quantity that the patient wants to be as large (or small) as possible, is to use multiobjective Bayesian optimization (abdolshah2019multi; knowles2006) to estimate the Pareto frontier. This approach, however, does not use interaction with the patient to focus optimization on the parts of the Pareto frontier most likely to contain the patient’s preferred solution. Intuitively, such information could accelerate optimization, especially when moderate or large numbers of attributes () create highdimensional Pareto frontiers and lead to a large number of Pareto optimal solutions.
Motivated by these shortcomings of existing approaches, we propose a novel Bayesian optimization approach that leverages learned preferences to solve the problem described above. By modeling uncertainty in the utility function, it improves the utility of the solution delivered over the “learnthenoptimize approach”. By leveraging judgments over attributes from the decisionmaker, it uses fewer objective function queries than multiobjective approaches.
Our approach uses preference learning and pairwise judgements from the decisionmaker to infer a Bayesian posterior distribution over the decisionmaker’s utility function. Within a Bayesian optimization framework, it models the objective using a multioutput Gaussian process and uses one of two novel acquisition functions, the expected improvement under utility uncertainty (uEI) or Thompson sampling under utility uncertainty, to iteratively choose designs at which to evaluate . Optionally, during optimization, decisionmaker judgements on the evaluated designs may be incorporated into our posterior distribution on the utility. At the conclusion of optimization, a menu of designs is shown to the decisionmaker who makes a final selection.
Our primary contribution is this pair of novel acquisition functions, uEI and uTS, which generalize existing Bayesian optimization acquisition functions to the utility uncertainty setting. We also provide an efficient simulationbased estimator of the gradient of uEI, which can be made more efficient still in the important special case of linear utility functions, and use these estimates within multistart stochastic gradient ascent to efficiently maximize uEI.
Our approach fills an important gap between today’s singleobjective optimization approaches, which assume perfect knowledge of the decisionmaker’s preferences, and multiobjective optimization approaches, which do not provide a principled way to accommodate partial information about preferences.
2 Problem Setting
We now formally describe our problem setting.
2.1 Designs and Attributes
We assume that both designs and attributes can be represented as vectors. More concretely, we assume that the space of designs can be represented as a compact set , and attributes are given by a derivativefree expensivetoevaluate blackbox continuous function, . As is common in BayesOpt, we assume that is not too large () and that projections onto can be efficiently computed.
2.2 DecisionMaker’s Preferences
We assume that there is a decisionmaker whose preference over designs is characterized by the the designs’ attributes through a Von NeumannMorgenstern utility function (vonNeuman), . This implies that the decisionmaker (strictly) prefers a design over if and only if . Thus, of all the designs, the decisionmaker most prefers one in the set . As is standard in preference learning (furnkranz2010preference), we assume that the decisionmaker can provide ordinal preferences between two designs and when shown previouslyevaluated attribute vectors and .
2.3 Interaction with the DecisionMaker and Computational Model
In our approach, an algorithm interacts sequentially with a human decisionmaker and a timeconsumingtoevaluate objective function (typically a computer model). The algorithm interacts with the computational model simply by selecting a design and evaluating . We let indicate the point at which we evaluate . As is standard in Bayesian optimization, the first set of computational model evaluations are chosen uniformly at random from the feasible domain, after which they are guided by an acquisition function described below in §3.
The algorithm interacts with the decisionmaker by receiving ordinal preferences between pairs of attribute vectors. We index interactions with the decisionmaker by , letting and refer to the attribute vectors queried in this interaction, and indicating the decisionmaker’s response, where indicates a preference for and for . We let be the number of design pairs evaluated by the decisionmaker by the completion of the th run of the computational model. We envision that the and would typically be the attribute vectors for previously evaluated design, and , where .
For concreteness, our numerical experiments assume that, before each evaluation of , the decisionmaker provides feedback on one pair of designs chosen uniformly at random from among those previously evaluated. Our framework easily supports other patterns of interaction. For example, it supports a setting where the decisionmaker provides feedback in a single batch after the firststage evaluations of the computational model are complete, either over random previously evaluated attribute vectors or using a more sophisticated and queryefficient selection of attribute vectors (lepird2015bayesian). It also supports a setting in which the decisionmaker provides feedback at a random series of time points on pairs of previously evaluated attribute vectors of their choosing.
2.4 Statistical Model Over
As is standard in BayesOpt (review), we place a (multioutput) Gaussian process (GP) prior on (alvarez2012kernels), , characterized by a mean function, , and a positive definite covariance function, ^{1}^{1}1Here, denotes the cone of positive definite matrices.. Thus, after observing noisefree evaluations of at points , the estimates of the designs’ attributes are given by the posterior distribution on , which is again a multioutput GP, , where and can be computed in closed form in terms of and (liu2018remarks).
2.5 Statistical Model Over
We use Bayesian preference learning (chu2005preference; lepird2015bayesian) to infer a posterior probability distribution over the utility function given preferences expressed by the decisionmaker. Although this method is standard in the literature, we describe it here for completeness.
We use a parametric family of utility functions , (following, for example, akrour2014programming; wirth2016model); a prior probability distribution over , ; and a likelihood function giving the conditional probability of the decisionmaker expressing preference in response to an offered pair of attribute vectors , with utility difference . The posterior distribution over after feedback on pairwise comparisons , written , is then given by Bayes’ rule:
In our approach, we rely only on the ability to sample from this posterior distribution.
The most widely used parametric family of utility functions is linear functions (wirth2017survey), with other examples including linear functions over kernelbased feature spaces (wirth2016model; kupcsik2018learning) and deep neural networks (christiano2017deep). Commonly used likelihood functions include probit and logit (wirth2017survey). In our numerical experiments, for simplicity, we assume fully accurate preference responses with parameteric families and priors described below. Although we assume parametric utility functions, conceptually, our approach generalizes to handle nonparametric Bayesian preference learning with noisy judgements (Chu and Ghahramani, 2005). However, this poses additional computational challenges as our approach internally performs optimization of samples of the utility function, which can be slow for nonparametric models.
2.6 Measure of Performance
We suppose that, after evaluations of the computational model (and judgements on attribute vector pairs), the decisionmaker selects her most preferred design among all evaluated designs. Thus, the utility generated, given , is
(1) 
and we wish to adaptively choose designs to evaluate, , so as to maximize the expected value of (1), where the expectation is taken over the prior on and the randomness in (induced by the random first stage of samples and randomness in the decisionmaker’s responses).
3 Acquisition Functions
Here we propose two novel acquisition functions, the Expected Improvement under Utility Uncertainty (uEI), and Thompson Sampling under Utility Uncertainty (uTS), for selecting points at which to query . The bulk of our development and analysis focuses on uEI, since this is the more difficult of the two to optimize, and this acquisition function performs the better of the two in numerical experiments.
3.1 Expected Improvement Under Utility Uncertainty (uEI)
Expected improvement is arguably the most popular acquisition function in BayesOpt. It has been successfully generalized for constrained (pmlrv32gardner14) and multiobjective optimization (emmerich2006single) and we next show that it can be naturally generalized in our setting as well by extending expected improvement’s onestep optimality analysis (jones1998efficient; frazier2018tutorial) to the setting with utility uncertainty.
After evaluating designs , the utility obtained by the decisionmaker when she selects her most preferred design among this set is
On the other hand, if we evaluate one more design, , the utility obtained by the decisionmaker increases by
This difference measures improvement from sampling . Thus a natural sampling policy is to evaluate the design that maximizes the expected improvement
(2) 
where the expectation is over both and , and indicates that the expectation is computed with respect to their corresponding posterior distributions given the previous computational evaluations , and decisionmaker responses .
We call uEI the expected improvement under utility uncertainty and refer to the above policy as the uEI policy. By construction, this sampling policy is onestep Bayes optimal.
3.1.1 Computation and Maximization of uEI
In contrast with the standard expected improvement, uEI cannot be computed in closed form. However, as we show next, it can still be efficiently maximized.
First, we introduce some notation. Making a slight abuse of notation, we denote by . We also let be the lower Cholesky factor of .
We note that, for any fixed , the time posterior distribution of is normal with mean and covariance matrix . Therefore, we can express , where is a variate standard normal random vector, and thus
This implies that we can compute using Monte Carlo as summarized in Algorithm 1.
In principle, the above is enough to maximize uEI using a derivativefree global optimization algorithm (for nonexpensive functions). However, we could optimize uEI more efficiently if we were able to leverage derivative information; this is the case using the derivative information we construct in the following proposition.
Proposition 1.
Under mild regularity conditions, is differentiable almost everywhere and its gradient, when it exists, is given by
where the expectation is over and , and
where the gradient is with respect to .
Thus, provides an unbiased estimator of which can be used within a gradientbased stochastic optimization algorithm, such as stochastic gradient ascent, to find stationary points of uEI. We may then start stochastic gradient ascent from multiple starting points and use simulation to evaluate the uEI for each and select the best. By increasing the number of starting points, we may find a highquality local optimum and asymptotically find a global optimum.
A formal statement and proof of Proposition 1, along with the proofs of all other theoretical results, can be found in the supplementary material.
3.1.2 Computation of uEI and Its Gradient When Is Linear
While the above approach can be used for efficiently maximizing uEI for general utility functions, we can make maximization even more efficient for linear utility functions, the most widely used class in practice.
Proposition 2.
Suppose that and for all and . Then,
where the expectation is over , , , , and and are the standard normal density function and cumulative distribution function, respectively.
The result above shows that, when each is linear, the computation of uEI essentially reduces to that of the standard expected improvement, modulus integrating the uncertainty over . In particular, in this case the uncertainty with respect to can be integrated out. Moreover, one can also derive an analogous result to Proposition 1 in which the explicit dependence on is eliminated as well.
Proposition 3.
Suppose that for all and . Then, under mild regularity conditions, is differentiable, and its gradient, when it exists, is given by
Analogously to Proposition 1, Proposition 3 provides a method for efficiently computing an unbiased estimator of . Moreover, it also implies that, if is discrete and its cardinality is not so large, the gradient of uEI can be computed exactly, allowing the use of faster nonstochastic optimization algorithms for maximizing uEI.
3.2 Exploitation vs. Exploration TradeOff
One of the key properties of the classical expected improvement acquisition function is that it is increasing with respect to both the posterior mean and variance. This means that it prefers to sample points that are either promising with respect to our current knowledge or are still highly uncertain, a desirable property of any sampling policy aiming to balance exploitation and exploration. The following result shows that, under standard assumptions on , the uEI sampling policy satisfies an analogous property.
Proposition 4.
Suppose, for every , is convex and increasing in each coordinate. Also suppose are such that and , where the first inequality is coordinatewise and denotes the partial order defined by the cone of positive semidefinite matrices. Then,
3.3 Thompson Sampling under Utility Uncertainty (uTS)
Thompson sampling for utility uncertainty (uTS) generalizes the wellknown Thompson sampling method thompson1933likelihood to the utility uncertainty setting.
It first samples from its posterior distribution. Then, it samples from its Gaussian process posterior distribution. To decide where to evaluate next, it optimizes using these sampled values and evaluates at the resulting maximizer.
This contrasts with the “learn then optimize” approach in that it samples from its posterior rather than simply setting it equal to a point estimate. For example, if we implemented learn then optimize using standard Thompson sampling, we would sample only from its posterior and then optimize where is a point estimate, such as the maximum a posteriori estimate. uTS can induce substantially more exploration than this more classical approach.
uTS can be implemented by sampling over a grid of points if is lowdimensional. It can also be implemented for higherdimensional by optimizing with a method for continuous nonlinear optimization (like CMA, hansen2016cma), lazily sampling from the posterior on each new point that CMA wants to evaluate, conditioning on previous real and sampled evaluations. We use the latter approach in our computational experiments.
4 Additional Related Work
The introduction discusses the two lines of most related work: the “learn then optimize” approach pursued within preferencebased reinforcement learning (PbRL); and multiobjective Bayesian optimization.
The most closely related work in PbRL is utilitybased PbRL using trajectory utilities (wirth2017survey). This variant of PbRL seeks to design a control policy to maximize the utility of a human subject using features computed from trajectories. Work in this area includes akrour2014programming; wirth2016model. Unlike our work, the uncertainty in utility function estimates is not considered when performing optimization.
Multiobjective BayesOpt includes knowles2006; bautista2009; binois2015quantifying; hl16; shah2016pareto; feliot2017bayesian. Multiobjective optimization cannot easily incorporate prior information about the decisionmaker’s preferences, though several attempts have been made, mostly through modified Paretodominance criteria or weightedsum approaches (cvetkovic2002preferences; zitzler2004indicator; rachmawati2006preference). Most of this work is outside the BayesOpt framework, with only three exceptions known to us.
feliot2018user proposes a weighted version of the expected Pareto hypervolume improvement approach (emmerich2006single) to focus the search on certain regions of the Pareto front. However, no method is provided for choosing weights from data, in contrast with our approach’s ability to learn from decisionmaker interactions. Moreover, this method suffers the same computational limitations of the standard expected Pareto hypervolume improvement approach limiting its applicability to at most three objectives (hl16). abdolshah2019multi also proposes a weighted version of the expected Pareto hypervolume improvement approach to explore the region of the Pareto frontier satisfying a preferenceorder constraint. Finally, paria2018flexible proposes an approach based on random scalarizations. In contrast with our approach, no method is available for choosing the distribution of these scalarizations from data.
Another related literature is preferential Bayesian optimization (gonzalez2017preferential). Within preferential Bayesian optimization gonzalez2017preferential, kupcsik2018learning studies optimization of a parameterized control policy for robotic object handover and brochu2010tutorial to realistic material design in computer graphics. To apply preferential Bayesian optimization in our setting, we would choose pairs of treatments and , evaluate our computational model and for each, and obtain feedback from the decisionmaker on which treatment is preferred. Using the results, it then chooses a new pair of treatments at which to query the patient, to best support the goal of finding their preferred design. Critically, these methods do not attempt to learn utility as a function of , but would instead learn it directly as a function of . For this reason, these direct methods tend to require many queries of the decisionmaker (wirth2017efficient; pinsler2018sample). Our approach leverages attribute observations to be more query efficient.
Our work is also related to a line of research on adaptive utility elicitation (chajewska1998utility; chajewska2000making; boutilier2002pomdp; boutilier2006constraint). Unlike in classical utility elicitation, which has accurate estimation as its final goal, this work elicits the decision maker’s utility function with the final goal of finding a good decision, even if this leaves residual uncertainty about the utility function (braziunas2006computational). However, this work assumes that attributes are inexpensive to evaluate, and that the set of designs is discrete and finite, preventing its use in our setting.
Our work builds on Bayesian optimization (brochu2010tutorial; review), a framework for optimization of expensivetoevaluate blackbox functions. Our proposed sampling policy is a natural generalization of the classical expected improvement sampling policy in standard Bayesian optimization. Our proposed sampling policy also generalizes the expected improvement for composite functions (astudillo2019bayesian), which can be obtained as a special case when is known.
Finally work is also related to frazier2011guessing, which pursued a similar preference uncertainty approach for the pure exploration multiarmed bandit problem with multiple attributes and linear utility functions, and without iterative interaction with the decisionmaker.
5 Experiments
We compare the performance of our sampling policies (uEI and uTS) against the policy that chooses the points to sample at random (Random) and, when , is increasing with respect to each attribute, against ParEGO (knowles2006), a popular BayesOpt algorithm for multiobjective BayesOpt.
In all problems, an initial stage of evaluations was performed using points chosen uniformly at random over . A second stage (pictured in plots) was then performed using the given sampling method. For all algorithms, the outputs of were modeled using independent GP prior distributions. All GP models involved in our experiments have a constant mean function and ARD Matern covariance function with smoothness parameter equal to ; the associated hyperparameters are estimated under a Bayesian approach. As proposed in snoek2012practical, for all algorithms we use an averaged version of the acquisition function, obtained by first drawing 10 samples of the GP hyperparameters, computing the acquisition function conditioned on each of these hyperparameters, and then averaging the results.
In all problems and for each replication, we draw one sample from the prior distribution to obtain a true underlying utility function, which is used to obtain the preference information from the decisionmaker. The performance of the algorithms is reported with respect to this true underlying utility function.
In problems described in §5.1, §5.3 and §5.4, the decisionmaker provides feedback after each evaluation of in the second stage. For simplicity, we assume in these experiments that preference feedback is free from noise. In the problem in §5.2, the decisionmaker does not provide feedback and instead we use our method with the prior distribution described there. Decisionmakers have preferences simulated from the prior distribution.
5.1 GPGenerated Test Problems
The first two problems used functions generated at random from a multioutput GP distribution with independent outputs. Each component of was generated by sampling on a uniform grid from a GP distribution with fixed hyperparameters and then taking the resulting posterior mean as a proxy; the hyperparameters were not known to any of the algorithms. In the first problem, , , the utility function is linear , and is uniform over the set ; i.e., the utility function’s prior distribution is uniform over the family of linear utility functions with positive coefficients. In the second problem, , , the utility function is quadratic, , where is uniform over and .
Results are shown on a logarithmic scale in Figures 1 and 1, where the horizontal axis indicates the number of samples following the initial stage. In the first test problem, uEI substantially outperforms Random and ParEGO, and performs almost identically to uTS. In the second test problem, uEI substantially outperforms Random and Naive, which perform almost identically; here we do not compare against ParEGO because the utility function is not increasing.
5.2 Optimization of Multiple Metrics Where Only One Will Be Considered
As a third experiment, we consider a situation where the output of a simulator provides several metrics of interest to be maximized but only one of them will be considered by the decisionmaker and we do not know which one. This can be easily formulated into our framework by considering the family of utility functions (i.e., is simply the th coordinate of ) and setting a probability distribution over them, which reflects our belief on which metric is more likely to be considered by the decisionmaker.
We test the ability of our algorithm to solve the above type of problems using a test function with three outputs, , where is the Ackley function (ackley), is the Levy function (levy), and , where is the Rosenbrock function (rosenbrock). Here we take and assume the distribution over the outputs is uniform. In constant with all other experiments, here we do not collect additional information of the decisionmaker’s preferences. Results are shown in a logarithmic scale in Figure 3 (left). In this problem ParEGO performs surprisingly well; it outperforms uTS accross all evaluations and outperforms uEI across evaluations 2570. However, uEI achieves the best final solution quality.
5.3 Portfolio Simulation Optimization
In this test problem, we use our algorithm to tune the hyperparameters of a trading strategy so as to maximize the return of a decisionmaker with an unknown risk aversion tolerance. We envision this as modeling a financial advisor that has many clients, each of which requires customized financial planning based on their own portfolio, and has a different risk tolerance. Using choices made by past clients about which financial product they prefer, the financial advisor may form a probability distribution over utility functions to use when using a computationally expensive simulation to develop a menu of options to show a new client.
We use CVXPortfolio (cvxportfolio) to simulate and optimize the evolution of a portfolio over a period of four years, from Jan. 2012 through Dec. 2015 using opensource market data; the details of the simulation can be found in §7.1 of cvxportfolio. Here, has two outputs, the mean and standard deviation of the daily returns. We also a nonstandard utility function that sets to if and otherwise. This recovers the constrained optimization problem that maximizes subject to the constraint that . Analogously to the case of linear utility functions, discussed in Proposition 2, it can be shown that for this class of utility functions, uEI admits an expression similar to that of the constrained expected improvement (gardner14).
Thus, in this setting we wish to maximize average return subject to an unknown constraint on the decisionmaker’s risk tolerance level , which we assume is uniform over . The hyperparameters to be tuned are the trade, hold, and risk aversion parameters over the domains , , and , respectively. Results are shown in Figure 4. Here, the optimal solution is unknown so we report the utility value instead. As before, uEI substantially outperforms Random and ParEGO.
5.4 Optimization of Ambulance Bases
In this test problem, we optimize the location of three ambulance bases according to the distribution of the response times. We consider 4 attributes, representing the fraction of response times falling within the intervals and assume a decisionmaker considers these attributes to choose the ideal locations of the ambulance bases. Due to the nature of these attributes, which are necessarily between 0 and 1, we model their logits instead of the attributes directly. We then use the utility function
which corresponds to a linear utility function over the original attributes. Here, is taken to be uniform over the set Results are shown in Figure 5.
6 Conclusion
We introduced a novel framework for supporting decisionmaking processes based on expensive physical or computational experiments when there is uncertainty about the decisionmaker’s preferences. Our approach aims to be more robust to this uncertainty, and our proposed algorithm is able to leverage prior information on the decisionmaker’s preferences to improve sampling efficiency.
References
Appendix A Unbiased Estimator of the Gradient of uEI
In this section we formally state and prove Proposition 1.
Proposition 1.
Suppose that is differentiable for all and let be an open subset of so that and are differentiable on and there exists a measurable function satisfying

for all , and .

, where is a variate standard normal random vector independent of , and the expectation is over both and .
Further, suppose that for almost every and the set is countable. Then, uEI is differentiable and its gradient, when it exists, is given by
where the expectation is over and , and
Proof.
From the given hypothesis it follows that, for any fixed and , the function is differentiable on . This in turn implies that the function is continuous on and differentiable at every such that , with gradient equal to . From our assumption that for almost every and the set is countable, it follows that for almost every and the function is continuous on and differentiable on all , except maybe on a countable subset. Using this, along with conditions 1 and 2, and Theorem 1 in l1990unified, the desired result follows. ∎
We note that, if one imposes the stronger condition , then has finite second moment, and thus this unbiased estimator of can be used within stochastic gradient ascent to find a stationary point of uEI (bottou1998online).
Appendix B Computation of uEI and Its Gradient When Is Linear
In this section we formally state and prove Propositions 2 and 3.
Proposition 2.
Suppose that for all and . Then,
where the expectation is over , , , and and are the standard normal probability density function and cumulative distribution function, respectively.
Proof.
Note that
Thus, it suffices to show that
but this can be easily verified by noting that, conditioned on , the time posterior distribution of is normal with mean and variance . ∎
Proposition 3.
Suppose that for all and , and are differentiable, and there exists a function satisfying

for all and .

.
Then, uEI is differentiable and its gradient is given by
Proof.
Recall that
Moreover, standard calculations show that
and
Thus, is a differentiable function of , and its gradient is given by
From conditions 1 and 2, and theorem 16.8 in billingsley1995probability, it follows that uEI is differentiable and its gradient is given by
i.e.,
∎
We end by noting that if is compact and and are both continuously differentiable, then
is continuous and thus attains its maximum value on (recall that is compact as well). Thus, in this case conditions 1 and 2 are satisfied by the constant function
Appendix C Exploration and Exploitation TradeOff
Proposition 4.
Suppose that for every is convex and nondecreasing. Also suppose are such that and , where the first inequality is coordinatewise and denotes the partial order defined by the cone of positive semidefinite matrices. Then,
Proof.
Since , we have that , where is a variate normal random vector with zero mean and covariance matrix independent of . Thus,
where the first and second inequalities follow from the fact that the function is increasing and convex, respectively, along with Jensen’s inequality. Finally, taking expectations with respect to yields the desired result. ∎