1 INTRODUCTION
###### Abstract

We consider black-box global optimization of time-consuming-to-evaluate functions on behalf of a decision-maker whose preferences must be learned. Each feasible design is associated with a time-consuming-to-evaluate vector of attributes, each vector of attributes is assigned a utility by the decision-maker’s utility function, and this utility function may be learned approximately using preferences expressed by the decision-maker over pairs of attribute vectors. Past work has used this estimated utility function as if it were error-free within single-objective optimization. However, errors in utility estimation may yield a poor suggested decision. Furthermore, this approach produces a single suggested “best” design, whereas decision-makers often prefer to choose among a menu of designs. We propose a novel Bayesian optimization algorithm that acknowledges the uncertainty in preference estimation and implicitly chooses designs to evaluate using the time-consuming function that are good not just for a single estimated utility function but a range of likely utility functions. Our algorithm then shows a menu of designs and evaluated attributes to the decision-maker who makes a final selection. We demonstrate the value of our algorithm in a variety of numerical experiments.

\aistatstitle

Bayesian optimization with uncertain preferences over attributes

\aistatsauthor

Raul Astudillo &Peter I. Frazier \aistatsaddress Cornell University &Cornell University, Uber

## 1 Introduction

We begin with a motivating example: helping a cancer patient (the “decision maker”) find the best treatment. Cancer treatments exhibit a range of abilities to cure disease, side effects and financial costs (aning2012patient; wong2013cancer; marshall2016women), referred to here as “attributes”. Suppose a patient considers real-valued attributes when selecting a cancer treatment. Also suppose a time-consuming-to-evaluate black-box computational simulator can use the patient’s medical history to compute the attributes, , of treatment . The patient has an implicit preference over these attributes and our goal is to help her find her most preferred treatment by querying our simulator.

One existing approach, pursued within preference-based reinforcement learning (wirth2017survey), is to first learn the patient’s preferences (chu2005preference; dewancker2016; abbas2018foundations) and then optimize using the learned estimates. We call this approach “learn then optimize”. This approach asks the patient for her preference between attribute vector , corresponding to pairs of treatments , . It then learns a utility function , e.g., using preference learning with Gaussian processes (chu2005preference), such that the judgements are as consistent as possible with the estimated utility differences . It then solves using a method for optimizing time-consuming-to-evaluate black-box functions, such as Bayesian optimization (BayesOpt) (frazier2018tutorial), assuming that the estimated utility function is correct. Optionally, if more judgements become available during optimization, these can be used to update our estimate (wirth2017survey). This approach, however, is not robust to uncertainty in preference estimates.

To better illustrate that becoming robust to uncertainty in preferences can improve performance, suppose that preference learning suggests that the patient’s true utility function is close to one of possible functions . Then, a better approach would be to offer the patient a set of treatments , where , and let her choose among them. This will provide near-optimal utility to the patient, while optimizing for a single point estimate of the utility function will not. While this approach improves over the standard approach in the utility it provides, it requires solving optimization problems with an expensive-to-evaluate objective, which becomes computationally infeasible as grows. Our approach (described below) delivers similar utility gains using fewer queries to the objective function.

Another approach, that can be used when each attribute is a quantity that the patient wants to be as large (or small) as possible, is to use multi-objective Bayesian optimization (abdolshah2019multi; knowles2006) to estimate the Pareto frontier. This approach, however, does not use interaction with the patient to focus optimization on the parts of the Pareto frontier most likely to contain the patient’s preferred solution. Intuitively, such information could accelerate optimization, especially when moderate or large numbers of attributes () create high-dimensional Pareto frontiers and lead to a large number of Pareto optimal solutions.

Motivated by these shortcomings of existing approaches, we propose a novel Bayesian optimization approach that leverages learned preferences to solve the problem described above. By modeling uncertainty in the utility function, it improves the utility of the solution delivered over the “learn-then-optimize approach”. By leveraging judgments over attributes from the decision-maker, it uses fewer objective function queries than multi-objective approaches.

Our approach uses preference learning and pairwise judgements from the decision-maker to infer a Bayesian posterior distribution over the decision-maker’s utility function. Within a Bayesian optimization framework, it models the objective using a multi-output Gaussian process and uses one of two novel acquisition functions, the expected improvement under utility uncertainty (u-EI) or Thompson sampling under utility uncertainty, to iteratively choose designs at which to evaluate . Optionally, during optimization, decision-maker judgements on the evaluated designs may be incorporated into our posterior distribution on the utility. At the conclusion of optimization, a menu of designs is shown to the decision-maker who makes a final selection.

Our primary contribution is this pair of novel acquisition functions, u-EI and u-TS, which generalize existing Bayesian optimization acquisition functions to the utility uncertainty setting. We also provide an efficient simulation-based estimator of the gradient of u-EI, which can be made more efficient still in the important special case of linear utility functions, and use these estimates within multi-start stochastic gradient ascent to efficiently maximize u-EI.

Our approach fills an important gap between today’s single-objective optimization approaches, which assume perfect knowledge of the decision-maker’s preferences, and multi-objective optimization approaches, which do not provide a principled way to accommodate partial information about preferences.

We first formalize our problem setting in §2 before reviewing other related work in §4 and defining the u-EI acquisition function in §3. §5 presents numerical experiments and §6 concludes.

## 2 Problem Setting

We now formally describe our problem setting.

### 2.1 Designs and Attributes

We assume that both designs and attributes can be represented as vectors. More concretely, we assume that the space of designs can be represented as a compact set , and attributes are given by a derivative-free expensive-to-evaluate black-box continuous function, . As is common in BayesOpt, we assume that is not too large () and that projections onto can be efficiently computed.

### 2.2 Decision-Maker’s Preferences

We assume that there is a decision-maker whose preference over designs is characterized by the the designs’ attributes through a Von Neumann-Morgenstern utility function (vonNeuman), . This implies that the decision-maker (strictly) prefers a design over if and only if . Thus, of all the designs, the decision-maker most prefers one in the set . As is standard in preference learning (furnkranz2010preference), we assume that the decision-maker can provide ordinal preferences between two designs and when shown previously-evaluated attribute vectors and .

### 2.3 Interaction with the Decision-Maker and Computational Model

In our approach, an algorithm interacts sequentially with a human decision-maker and a time-consuming-to-evaluate objective function (typically a computer model). The algorithm interacts with the computational model simply by selecting a design and evaluating . We let indicate the point at which we evaluate . As is standard in Bayesian optimization, the first set of computational model evaluations are chosen uniformly at random from the feasible domain, after which they are guided by an acquisition function described below in §3.

The algorithm interacts with the decision-maker by receiving ordinal preferences between pairs of attribute vectors. We index interactions with the decision-maker by , letting and refer to the attribute vectors queried in this interaction, and indicating the decision-maker’s response, where indicates a preference for and for . We let be the number of design pairs evaluated by the decision-maker by the completion of the th run of the computational model. We envision that the and would typically be the attribute vectors for previously evaluated design, and , where .

For concreteness, our numerical experiments assume that, before each evaluation of , the decision-maker provides feedback on one pair of designs chosen uniformly at random from among those previously evaluated. Our framework easily supports other patterns of interaction. For example, it supports a setting where the decision-maker provides feedback in a single batch after the first-stage evaluations of the computational model are complete, either over random previously evaluated attribute vectors or using a more sophisticated and query-efficient selection of attribute vectors (lepird2015bayesian). It also supports a setting in which the decision-maker provides feedback at a random series of time points on pairs of previously evaluated attribute vectors of their choosing.

### 2.4 Statistical Model Over f

As is standard in BayesOpt (review), we place a (multi-output) Gaussian process (GP) prior on (alvarez2012kernels), , characterized by a mean function, , and a positive definite covariance function, 111Here, denotes the cone of positive definite matrices.. Thus, after observing noise-free evaluations of at points , the estimates of the designs’ attributes are given by the posterior distribution on , which is again a multi-output GP, , where and can be computed in closed form in terms of and (liu2018remarks).

### 2.5 Statistical Model Over U

We use Bayesian preference learning (chu2005preference; lepird2015bayesian) to infer a posterior probability distribution over the utility function given preferences expressed by the decision-maker. Although this method is standard in the literature, we describe it here for completeness.

We use a parametric family of utility functions , (following, for example, akrour2014programming; wirth2016model); a prior probability distribution over , ; and a likelihood function giving the conditional probability of the decision-maker expressing preference in response to an offered pair of attribute vectors , with utility difference . The posterior distribution over after feedback on pairwise comparisons , written , is then given by Bayes’ rule:

 pθm(θ)∝pθ(θ)∏mL(am;U(ym;θ)−U(y′m;θ)).

In our approach, we rely only on the ability to sample from this posterior distribution.

The most widely used parametric family of utility functions is linear functions (wirth2017survey), with other examples including linear functions over kernel-based feature spaces (wirth2016model; kupcsik2018learning) and deep neural networks (christiano2017deep). Commonly used likelihood functions include probit and logit (wirth2017survey). In our numerical experiments, for simplicity, we assume fully accurate preference responses with parameteric families and priors described below. Although we assume parametric utility functions, conceptually, our approach generalizes to handle nonparametric Bayesian preference learning with noisy judgements (Chu and Ghahramani, 2005). However, this poses additional computational challenges as our approach internally performs optimization of samples of the utility function, which can be slow for nonparametric models.

### 2.6 Measure of Performance

We suppose that, after evaluations of the computational model (and judgements on attribute vector pairs), the decision-maker selects her most preferred design among all evaluated designs. Thus, the utility generated, given , is

 maxi=1,…,NU(f(xi);θ), (1)

and we wish to adaptively choose designs to evaluate, , so as to maximize the expected value of (1), where the expectation is taken over the prior on and the randomness in (induced by the random first stage of samples and randomness in the decision-maker’s responses).

## 3 Acquisition Functions

Here we propose two novel acquisition functions, the Expected Improvement under Utility Uncertainty (u-EI), and Thompson Sampling under Utility Uncertainty (u-TS), for selecting points at which to query . The bulk of our development and analysis focuses on u-EI, since this is the more difficult of the two to optimize, and this acquisition function performs the better of the two in numerical experiments.

### 3.1 Expected Improvement Under Utility Uncertainty (u-EI)

Expected improvement is arguably the most popular acquisition function in BayesOpt. It has been successfully generalized for constrained (pmlr-v32-gardner14) and multi-objective optimization (emmerich2006single) and we next show that it can be naturally generalized in our setting as well by extending expected improvement’s one-step optimality analysis (jones1998efficient; frazier2018tutorial) to the setting with utility uncertainty.

After evaluating designs , the utility obtained by the decision-maker when she selects her most preferred design among this set is

 U∗n(f;θ):=maxi=1,…,nU(f(xi);θ).

On the other hand, if we evaluate one more design, , the utility obtained by the decision-maker increases by

 =max{U(f(x);θ),U∗n(f;θ)}−U∗n(f;θ) ={U(f(x);θ)−U∗n(f;θ)}+.

This difference measures improvement from sampling . Thus a natural sampling policy is to evaluate the design that maximizes the expected improvement

 u-EIn(x):=En[{U(f(x);θ)−U∗n(f;θ)}+], (2)

where the expectation is over both and , and indicates that the expectation is computed with respect to their corresponding posterior distributions given the previous computational evaluations , and decision-maker responses .

We call u-EI the expected improvement under utility uncertainty and refer to the above policy as the u-EI policy. By construction, this sampling policy is one-step Bayes optimal.

#### 3.1.1 Computation and Maximization of u-EI

In contrast with the standard expected improvement, u-EI cannot be computed in closed form. However, as we show next, it can still be efficiently maximized.

First, we introduce some notation. Making a slight abuse of notation, we denote by . We also let be the lower Cholesky factor of .

We note that, for any fixed , the time- posterior distribution of is normal with mean and covariance matrix . Therefore, we can express , where is a -variate standard normal random vector, and thus

 u-EIn(x)=En[{U(μn(x)+Cn(x)Z;θ)−U∗n(f;θ)}+].

This implies that we can compute using Monte Carlo as summarized in Algorithm 1.

In principle, the above is enough to maximize u-EI using a derivative-free global optimization algorithm (for non-expensive functions). However, we could optimize u-EI more efficiently if we were able to leverage derivative information; this is the case using the derivative information we construct in the following proposition.

###### Proposition 1.

Under mild regularity conditions, is differentiable almost everywhere and its gradient, when it exists, is given by

 ∇u-EIn(x)=En[γn(x,Z;θ)],

where the expectation is over and , and

 γ(x,Z;θ)={0, if U(μn(x)+Cn(x)Z;θ)≤U∗n(f;θ)∇U(μn(x)+Cn(x)Z;θ), otherwise.

where the gradient is with respect to .

Thus, provides an unbiased estimator of which can be used within a gradient-based stochastic optimization algorithm, such as stochastic gradient ascent, to find stationary points of u-EI. We may then start stochastic gradient ascent from multiple starting points and use simulation to evaluate the u-EI for each and select the best. By increasing the number of starting points, we may find a high-quality local optimum and asymptotically find a global optimum.

A formal statement and proof of Proposition 1, along with the proofs of all other theoretical results, can be found in the supplementary material.

#### 3.1.2 Computation of u-EI and Its Gradient When U Is Linear

While the above approach can be used for efficiently maximizing u-EI for general utility functions, we can make maximization even more efficient for linear utility functions, the most widely used class in practice.

###### Proposition 2.

Suppose that and for all and . Then,

 u-EIn(x)=En[Δn(x;θ)Φ(ζ)+σn(x;θ)φ(ζ)],

where the expectation is over , , , , and and are the standard normal density function and cumulative distribution function, respectively.

The result above shows that, when each is linear, the computation of u-EI essentially reduces to that of the standard expected improvement, modulus integrating the uncertainty over . In particular, in this case the uncertainty with respect to can be integrated out. Moreover, one can also derive an analogous result to Proposition 1 in which the explicit dependence on is eliminated as well.

###### Proposition 3.

Suppose that for all and . Then, under mild regularity conditions, is differentiable, and its gradient, when it exists, is given by

 En[φ(ζ)2σn(x;θ)(θ⊤∇μn(x))Φ(ζ)+φ(ζ)2σn(x;θ)k∑i,j=1θiθj∇Kn(x)i,j].

Analogously to Proposition 1, Proposition 3 provides a method for efficiently computing an unbiased estimator of . Moreover, it also implies that, if is discrete and its cardinality is not so large, the gradient of u-EI can be computed exactly, allowing the use of faster non-stochastic optimization algorithms for maximizing u-EI.

### 3.2 Exploitation vs. Exploration Trade-Off

One of the key properties of the classical expected improvement acquisition function is that it is increasing with respect to both the posterior mean and variance. This means that it prefers to sample points that are either promising with respect to our current knowledge or are still highly uncertain, a desirable property of any sampling policy aiming to balance exploitation and exploration. The following result shows that, under standard assumptions on , the u-EI sampling policy satisfies an analogous property.

###### Proposition 4.

Suppose, for every , is convex and increasing in each coordinate. Also suppose are such that and , where the first inequality is coordinate-wise and denotes the partial order defined by the cone of positive semi-definite matrices. Then,

 u-EIn(x)≥u-EIn(x′).

### 3.3 Thompson Sampling under Utility Uncertainty (u-TS)

Thompson sampling for utility uncertainty (u-TS) generalizes the well-known Thompson sampling method thompson1933likelihood to the utility uncertainty setting.

It first samples from its posterior distribution. Then, it samples from its Gaussian process posterior distribution. To decide where to evaluate next, it optimizes using these sampled values and evaluates at the resulting maximizer.

This contrasts with the “learn then optimize” approach in that it samples from its posterior rather than simply setting it equal to a point estimate. For example, if we implemented learn then optimize using standard Thompson sampling, we would sample only from its posterior and then optimize where is a point estimate, such as the maximum a posteriori estimate. u-TS can induce substantially more exploration than this more classical approach.

u-TS can be implemented by sampling over a grid of points if is low-dimensional. It can also be implemented for higher-dimensional by optimizing with a method for continuous nonlinear optimization (like CMA, hansen2016cma), lazily sampling from the posterior on each new point that CMA wants to evaluate, conditioning on previous real and sampled evaluations. We use the latter approach in our computational experiments.

The introduction discusses the two lines of most related work: the “learn then optimize” approach pursued within preference-based reinforcement learning (PbRL); and multi-objective Bayesian optimization.

The most closely related work in PbRL is utility-based PbRL using trajectory utilities (wirth2017survey). This variant of PbRL seeks to design a control policy to maximize the utility of a human subject using features computed from trajectories. Work in this area includes akrour2014programming; wirth2016model. Unlike our work, the uncertainty in utility function estimates is not considered when performing optimization.

Multi-objective BayesOpt includes knowles2006; bautista2009; binois2015quantifying; hl16; shah2016pareto; feliot2017bayesian. Multi-objective optimization cannot easily incorporate prior information about the decision-maker’s preferences, though several attempts have been made, mostly through modified Pareto-dominance criteria or weighted-sum approaches (cvetkovic2002preferences; zitzler2004indicator; rachmawati2006preference). Most of this work is outside the BayesOpt framework, with only three exceptions known to us.

feliot2018user proposes a weighted version of the expected Pareto hypervolume improvement approach (emmerich2006single) to focus the search on certain regions of the Pareto front. However, no method is provided for choosing weights from data, in contrast with our approach’s ability to learn from decision-maker interactions. Moreover, this method suffers the same computational limitations of the standard expected Pareto hypervolume improvement approach limiting its applicability to at most three objectives (hl16). abdolshah2019multi also proposes a weighted version of the expected Pareto hypervolume improvement approach to explore the region of the Pareto frontier satisfying a preference-order constraint. Finally, paria2018flexible proposes an approach based on random scalarizations. In contrast with our approach, no method is available for choosing the distribution of these scalarizations from data.

Another related literature is preferential Bayesian optimization (gonzalez2017preferential). Within preferential Bayesian optimization gonzalez2017preferential, kupcsik2018learning studies optimization of a parameterized control policy for robotic object handover and brochu2010tutorial to realistic material design in computer graphics. To apply preferential Bayesian optimization in our setting, we would choose pairs of treatments and , evaluate our computational model and for each, and obtain feedback from the decision-maker on which treatment is preferred. Using the results, it then chooses a new pair of treatments at which to query the patient, to best support the goal of finding their preferred design. Critically, these methods do not attempt to learn utility as a function of , but would instead learn it directly as a function of . For this reason, these direct methods tend to require many queries of the decision-maker (wirth2017efficient; pinsler2018sample). Our approach leverages attribute observations to be more query efficient.

Our work is also related to a line of research on adaptive utility elicitation (chajewska1998utility; chajewska2000making; boutilier2002pomdp; boutilier2006constraint). Unlike in classical utility elicitation, which has accurate estimation as its final goal, this work elicits the decision maker’s utility function with the final goal of finding a good decision, even if this leaves residual uncertainty about the utility function (braziunas2006computational). However, this work assumes that attributes are inexpensive to evaluate, and that the set of designs is discrete and finite, preventing its use in our setting.

Our work builds on Bayesian optimization (brochu2010tutorial; review), a framework for optimization of expensive-to-evaluate black-box functions. Our proposed sampling policy is a natural generalization of the classical expected improvement sampling policy in standard Bayesian optimization. Our proposed sampling policy also generalizes the expected improvement for composite functions (astudillo2019bayesian), which can be obtained as a special case when is known.

Finally work is also related to frazier2011guessing, which pursued a similar preference uncertainty approach for the pure exploration multi-armed bandit problem with multiple attributes and linear utility functions, and without iterative interaction with the decision-maker.

## 5 Experiments

We compare the performance of our sampling policies (u-EI and u-TS) against the policy that chooses the points to sample at random (Random) and, when , is increasing with respect to each attribute, against ParEGO (knowles2006), a popular BayesOpt algorithm for multi-objective BayesOpt.

In all problems, an initial stage of evaluations was performed using points chosen uniformly at random over . A second stage (pictured in plots) was then performed using the given sampling method. For all algorithms, the outputs of were modeled using independent GP prior distributions. All GP models involved in our experiments have a constant mean function and ARD Matern covariance function with smoothness parameter equal to ; the associated hyperparameters are estimated under a Bayesian approach. As proposed in snoek2012practical, for all algorithms we use an averaged version of the acquisition function, obtained by first drawing 10 samples of the GP hyperparameters, computing the acquisition function conditioned on each of these hyperparameters, and then averaging the results.

In all problems and for each replication, we draw one sample from the prior distribution to obtain a true underlying utility function, which is used to obtain the preference information from the decision-maker. The performance of the algorithms is reported with respect to this true underlying utility function.

In problems described in §5.1, §5.3 and §5.4, the decision-maker provides feedback after each evaluation of in the second stage. For simplicity, we assume in these experiments that preference feedback is free from noise. In the problem in §5.2, the decision-maker does not provide feedback and instead we use our method with the prior distribution described there. Decision-makers have preferences simulated from the prior distribution.

### 5.1 GP-Generated Test Problems

The first two problems used functions generated at random from a multi-output GP distribution with independent outputs. Each component of was generated by sampling on a uniform grid from a GP distribution with fixed hyperparameters and then taking the resulting posterior mean as a proxy; the hyperparameters were not known to any of the algorithms. In the first problem, , , the utility function is linear , and is uniform over the set ; i.e., the utility function’s prior distribution is uniform over the family of linear utility functions with positive coefficients. In the second problem, , , the utility function is quadratic, , where is uniform over and .

Results are shown on a logarithmic scale in Figures  1 and  1, where the horizontal axis indicates the number of samples following the initial stage. In the first test problem, u-EI substantially outperforms Random and ParEGO, and performs almost identically to u-TS. In the second test problem, u-EI substantially outperforms Random and Naive, which perform almost identically; here we do not compare against ParEGO because the utility function is not increasing.

### 5.2 Optimization of Multiple Metrics Where Only One Will Be Considered

As a third experiment, we consider a situation where the output of a simulator provides several metrics of interest to be maximized but only one of them will be considered by the decision-maker and we do not know which one. This can be easily formulated into our framework by considering the family of utility functions (i.e., is simply the -th coordinate of ) and setting a probability distribution over them, which reflects our belief on which metric is more likely to be considered by the decision-maker.

We test the ability of our algorithm to solve the above type of problems using a test function with three outputs, , where is the Ackley function (ackley), is the Levy function (levy), and , where is the Rosenbrock function (rosenbrock). Here we take and assume the distribution over the outputs is uniform. In constant with all other experiments, here we do not collect additional information of the decision-maker’s preferences. Results are shown in a logarithmic scale in Figure 3 (left). In this problem ParEGO performs surprisingly well; it outperforms u-TS accross all evaluations and outperforms u-EI across evaluations 25-70. However, u-EI achieves the best final solution quality.

### 5.3 Portfolio Simulation Optimization

In this test problem, we use our algorithm to tune the hyper-parameters of a trading strategy so as to maximize the return of a decision-maker with an unknown risk aversion tolerance. We envision this as modeling a financial advisor that has many clients, each of which requires customized financial planning based on their own portfolio, and has a different risk tolerance. Using choices made by past clients about which financial product they prefer, the financial advisor may form a probability distribution over utility functions to use when using a computationally expensive simulation to develop a menu of options to show a new client.

We use CVXPortfolio (cvxportfolio) to simulate and optimize the evolution of a portfolio over a period of four years, from Jan. 2012 through Dec. 2015 using open-source market data; the details of the simulation can be found in §7.1 of cvxportfolio. Here, has two outputs, the mean and standard deviation of the daily returns. We also a non-standard utility function that sets to if and otherwise. This recovers the constrained optimization problem that maximizes subject to the constraint that . Analogously to the case of linear utility functions, discussed in Proposition 2, it can be shown that for this class of utility functions, u-EI admits an expression similar to that of the constrained expected improvement (gardner14).

Thus, in this setting we wish to maximize average return subject to an unknown constraint on the decision-maker’s risk tolerance level , which we assume is uniform over . The hyper-parameters to be tuned are the trade, hold, and risk aversion parameters over the domains , , and , respectively. Results are shown in Figure 4. Here, the optimal solution is unknown so we report the utility value instead. As before, u-EI substantially outperforms Random and ParEGO.

### 5.4 Optimization of Ambulance Bases

In this test problem, we optimize the location of three ambulance bases according to the distribution of the response times. We consider 4 attributes, representing the fraction of response times falling within the intervals and assume a decision-maker considers these attributes to choose the ideal locations of the ambulance bases. Due to the nature of these attributes, which are necessarily between 0 and 1, we model their logits instead of the attributes directly. We then use the utility function

 U(y;θ)=4∑j=1θjexpyj1+expyj+θ5(1−4∑j=1expyj1+expyj),

which corresponds to a linear utility function over the original attributes. Here, is taken to be uniform over the set Results are shown in Figure 5.

## 6 Conclusion

We introduced a novel framework for supporting decision-making processes based on expensive physical or computational experiments when there is uncertainty about the decision-maker’s preferences. Our approach aims to be more robust to this uncertainty, and our proposed algorithm is able to leverage prior information on the decision-maker’s preferences to improve sampling efficiency.

## Appendix A Unbiased Estimator of the Gradient of u-EI

In this section we formally state and prove Proposition 1.

###### Proposition 1.

Suppose that is differentiable for all and let be an open subset of so that and are differentiable on and there exists a measurable function satisfying

1. for all , and .

2. , where is a -variate standard normal random vector independent of , and the expectation is over both and .

Further, suppose that for almost every and the set is countable. Then, u-EI is differentiable and its gradient, when it exists, is given by

 ∇u-EI(x)=E[γ(x,θ,Z)],

where the expectation is over and , and

 γ(x,θ,Z)={∇U(μn(x)+Cn(x)Z;θ), if U(μn(x)+Cn(x)Z)>U∗n(f;θ),0, otherwise.
###### Proof.

From the given hypothesis it follows that, for any fixed and , the function is differentiable on . This in turn implies that the function is continuous on and differentiable at every such that , with gradient equal to . From our assumption that for almost every and the set is countable, it follows that for almost every and the function is continuous on and differentiable on all , except maybe on a countable subset. Using this, along with conditions 1 and 2, and Theorem 1 in l1990unified, the desired result follows. ∎

We note that, if one imposes the stronger condition , then has finite second moment, and thus this unbiased estimator of can be used within stochastic gradient ascent to find a stationary point of u-EI (bottou1998online).

## Appendix B Computation of u-EI and Its Gradient When U Is Linear

In this section we formally state and prove Propositions 2 and 3.

###### Proposition 2.

Suppose that for all and . Then,

 u-EI(x)=En[Δn(x;θ)Φ(Δn(x;θ)σn(x;θ))+σn(x;θ)φ(Δn(x;θ)σn(x;θ))]

where the expectation is over , , , and and are the standard normal probability density function and cumulative distribution function, respectively.

###### Proof.

Note that

 u-EI(x)=En[En[{θ⊤f(x)−U∗n(f;θ)}+∣θ]].

Thus, it suffices to show that

 En[{θ⊤f(x)−U∗n(f;θ)}+∣θ]=Δn(x;θ)Φ(Δn(x;θ)σn(x;θ))+σn(x;θ)φ(Δn(x;θ)σn(x;θ)),

but this can be easily verified by noting that, conditioned on , the time- posterior distribution of is normal with mean and variance . ∎

###### Proposition 3.

Suppose that for all and , and are differentiable, and there exists a function satisfying

1. for all and .

2. .

Then, u-EI is differentiable and its gradient is given by

 ∇u-EI(x)=En⎡⎢ ⎢⎣(θ⊤∇μn(x))Φ(Δn(x;θ)σn(x;θ))+φ(Δn(x;θ)σn(x;θ))2σn(x;θ)m∑i,j=1θiθj∇Kn(x)i,j⎤⎥ ⎥⎦.
###### Proof.

Recall that

 En[{θ⊤f(x)−U∗n(f;θ)}+∣θ]=Δn(x;θ)Φ(Δn(x;θ)σn(x;θ))+σn(x;θ)φ(Δn(x;θ)σn(x;θ)).

Moreover, standard calculations show that

 ∇[Δn(x;θ)Φ(Δn(x;θ)σn(x;θ))]=(θ⊤∇μn(x))Φ(Δn(x;θ)σn(x;θ))+Δn(x;θ)φ(Δn(x;θ)σn(x;θ))∇Δn(x;θ)σn(x;θ),

and

 ∇[σn(x;θ)φ(Δn(x;θ)σn(x;θ))] =φ(Δn(x;θ)σn(x;θ))2σn(x;θ)m∑i,j=1θiθj∇Kn(x)i,j+σn(x;θ)[−Δn(x;θ)σn(x;θ)φ(Δn(x;θ)σn(x;θ))∇Δn(x;θ)σn(x;θ)] =φ(Δn(x;θ)σn(x;θ))2σn(x;θ)m∑i,j=1θiθj∇Kn(x)i,j−Δn(x;θ)φ(Δn(x;θ)σn(x;θ))∇Δn(x;θ)σn(x;θ).

Thus, is a differentiable function of , and its gradient is given by

 ∇En[{θ⊤f(x)−U∗n(f;θ)}+∣θ]=(θ⊤∇μn(x))Φ(Δn(x;θ)σn(x;θ))+φ(Δn(x;θ)σn(x;θ))2σn(x;θ)m∑i,j=1θiθj∇Kn(x)i,j.

From conditions 1 and 2, and theorem 16.8 in billingsley1995probability, it follows that u-EI is differentiable and its gradient is given by

 ∇u-EI(x)=En[∇En[{θ⊤f(x)−U∗n(f;θ)}+∣θ]]

i.e.,

 ∇u-EI(x)=En⎡⎢ ⎢⎣(θ⊤∇μn(x))Φ(Δn(x;θ)σn(x;θ))+φ(Δn(x;θ)σn(x;θ))2σn(x;θ)m∑i,j=1θiθj∇Kn(x)i,j⎤⎥ ⎥⎦.

We end by noting that if is compact and and are both continuously differentiable, then

 (θ,x)→∥∥ ∥ ∥∥(θ⊤∇μn(x))Φ(Δn(x;θ)σn(x;θ))+φ(Δn(x;θ)σn(x;θ))2σn(x;θ)m∑i,j=1θiθj∇Kn(x)i,j∥∥ ∥ ∥∥

is continuous and thus attains its maximum value on (recall that is compact as well). Thus, in this case conditions 1 and 2 are satisfied by the constant function

 η≡max(θ,x)∈Θ×X∥∥ ∥ ∥∥(θ⊤∇μn(x))Φ(Δn(x;θ)σn(x;θ))+φ(Δn(x;θ)σn(x;θ))2σn(x;θ)m∑i,j=1θiθj∇Kn(x)i,j∥∥ ∥ ∥∥.

## Appendix C Exploration and Exploitation Trade-Off

###### Proposition 4.

Suppose that for every is convex and non-decreasing. Also suppose are such that and , where the first inequality is coordinate-wise and denotes the partial order defined by the cone of positive semi-definite matrices. Then,

 u-EIn(x)≥u-EIn(x′).
###### Proof.

Since , we have that , where is a -variate normal random vector with zero mean and covariance matrix independent of . Thus,

 En[{U(f(x);θ)−U∗n(f;θ)}+∣θ] =En[{U(f(x′)+(μn(x)−μn(x′))+W;θ)−U∗n(f;θ)}+∣θ] ≥En[{U(f(x′)+W;θ)−U∗n(f;θ)}+∣θ] =En[En[{U(f(x′)+W;θ)−U∗n(f;θ)}+∣θ,f(x′)]] ≥En[{U(f(x′);θ)−U∗n(f;θ)}+∣θ],

where the first and second inequalities follow from the fact that the function is increasing and convex, respectively, along with Jensen’s inequality. Finally, taking expectations with respect to yields the desired result. ∎

## References

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters