1 INTRODUCTION

## Abstract

We consider black-box global optimization of time-consuming-to-evaluate functions on behalf of a decision-maker (DM) whose preferences must be learned. Each feasible design is associated with a time-consuming-to-evaluate vector of attributes and each vector of attributes is assigned a utility by the DM’s utility function, which may be learned approximately using preferences expressed over pairs of attribute vectors. Past work has used a point estimate of this utility function as if it were error-free within single-objective optimization. However, utility estimation errors may yield a poor suggested design. Furthermore, this approach produces a single suggested “best” design, whereas DMs often prefer to choose from a menu. We propose a novel multi-attribute Bayesian optimization with preference learning approach. Our approach acknowledges the uncertainty in preference estimation and implicitly chooses designs to evaluate that are good not just for a single estimated utility function but a range of likely ones. The outcome of our approach is a menu of designs and evaluated attributes from which the DM makes a final selection. We demonstrate the value and flexibility of our approach in a variety of experiments.

\defaultbibliography

bibl \defaultbibliographystyleapalike

\runningtitle

Multi-attribute Bayesian optimization with interactive preference learning

{bibunit}

## 1 Introduction

We begin with a motivating example: helping a cancer patient (the “decision-maker” or “DM”) find the best treatment. Cancer treatments exhibit a range of efficacies, side effects and financial costs (aning2012patient; wong2013cancer; marshall2016women), referred to here as “attributes”. Suppose a patient considers real-valued attributes when selecting a cancer treatment. Also suppose a time-consuming-to-evaluate black-box computational simulator can use the patient’s medical history to compute the attributes, , of treatment . The patient has an implicit preference over these attributes and our goal is to help her find her most preferred treatment by querying our simulator.

One existing approach, pursued within preference-based reinforcement learning (wirth2017survey), is to first learn a point estimate of the patient’s preferences (dewancker2016; abbas2018foundations) and then optimize assuming this point estimate is correct. We call this the “point-estimate approach”. This approach asks the patient for her preference between attribute vectors and corresponding to pairs of treatments and , and uses this information to learn a utility function , e.g., using preference learning with Gaussian processes (chu2005preference), such that the judgments are as consistent as possible with the estimated utility differences . It then solves using a method for optimizing time-consuming-to-evaluate black-box functions, such as Bayesian optimization (BayesOpt) (frazier2018tutorial), assuming that the estimated utility function is correct. This approach, however, is not robust to residual uncertainty in preference estimates.

To better illustrate that becoming robust to uncertainty in preferences can improve performance, suppose that preference learning suggests that the patient’s true utility function is close to one of possible functions . Then, a better approach would be to offer the patient a set of treatments , where , and let her choose among them. This will provide near-optimal utility to the patient, while optimizing for a single point estimate of the utility function will not. While this approach improves over the standard approach in the utility it provides, it requires solving optimization problems with a time-consuming-to-evaluate objective, which becomes computationally infeasible as grows. Our approach (described below) delivers similar utility gains using fewer queries to the objective function.

Another approach, which can be used when each attribute is a quantity that the patient wants to be as large (or small) as possible, is to use multi-objective Bayesian optimization (knowles2006; abdolshah2019multi) to estimate the Pareto frontier. This approach, however, typically does not use interaction with the patient to focus optimization on the parts of the Pareto frontier most likely to contain the patient’s preferred solution. Intuitively, such information could accelerate optimization, especially when moderate or large numbers of attributes () create high-dimensional Pareto frontiers and lead to many Pareto optimal solutions.

Motivated by the shortcomings of existing approaches, we propose optimization with preference learning, which learns preferences from the DM’s feedback and acknowledges uncertainty in these learned preferences. In contrast with the point-estimate approach, our approach is significantly more robust to residual preference uncertainty because its optimization actions are appropriate for a range of plausible utility functions. In contrast with multi-objective optimization approaches, learned preferences allow our approach to use fewer objective function queries by focusing optimization on portions of the attribute space most likely to be preferred by the DM. Our approach, therefore, fills an important gap between today’s single-objective optimization approaches, which assume perfect knowledge of preferences, and multi-objective optimization approaches, which do not provide a principled way to accommodate partial preference information.

We develop optimization with preference learning within the specific context of Bayesian optimization. We use pairwise judgments from the DM to form a Bayesian posterior distribution over her utility function and model the attributes with a multi-output Gaussian process. We then use one of two novel acquisition functions, the expected improvement under utility uncertainty (EI-UU) or Thompson sampling under utility uncertainty (TS-UU), to iteratively choose designs at which to evaluate . Optionally, during optimization, additional DM’s judgments on the evaluated designs may be incorporated into our posterior distribution on the utility. At the conclusion of optimization, a menu of designs is shown to the DM, who makes a final selection.

Our proposed acquisition functions, EI-UU and TS-UU, generalize existing Bayesian optimization acquisition functions to the optimization with preference learning setting. EI-UU is more challenging to maximize than its classical counterpart. However, we provide a simulation-based method for computing an unbiased estimator of its gradient, which we use within a multi-start stochastic gradient optimization method.

The reminder of this paper is organized as follows. We first formalize our problem setting in §2, before defining the EI-UU acquisition function in §3, and reviewing other related work in §4. §5 presents numerical experiments, and §6 concludes.

## 2 Problem Setting

We now formally describe our problem setting.

### 2.1 Designs and Attributes

We assume that both designs and attributes can be represented as vectors. More concretely, we assume that the space of designs can be represented as a compact set , and attributes are given by a derivative-free time-consuming-to-evaluate black-box continuous function, . As is common in BayesOpt, we assume that is a simple set such as a hyperrectangle or a polytope, and that is not too large ().

### 2.2 Decision-Maker’s Preferences

We assume that there is a DM whose preference over designs is characterized by the the designs’ attributes through a Von Neumann-Morgenstern utility function (vonNeuman), . This implies that the DM (strictly) prefers a design over if and only if . Thus, of all the designs, the DM most prefers one in the set . As is standard in preference learning (furnkranz2010preference), we assume that the DM can provide ordinal preferences between two designs and when shown previously evaluated attribute vectors and .

### 2.3 Interaction With the Decision-Maker and Computational Model

In our approach, an algorithm interacts sequentially with a human DM and a time-consuming-to-evaluate objective function (typically a computer model). The algorithm interacts with the computational model simply by selecting a design and evaluating . We let indicate the point at which we evaluate . As is standard in BayesOpt, the first set of evaluations of is chosen uniformly at random or according to a space-filling design over the feasible domain (joseph2016space), and subsequent evaluations are guided by an acquisition function described below in §3.

The algorithm interacts with the DM by receiving ordinal preferences between pairs of attribute vectors. We index interactions with the DM by , letting and refer to the attribute vectors queried in this interaction, and indicating the DM’s response, where indicates a preference for , indicates indifference, and indicates preference for . We let be the number of design pairs evaluated by the DM by the completion of the run of the computational model. We envision that the and would typically be the attribute vectors for previously evaluated designs, and , where .

For concreteness, our numerical experiments assume that, before each evaluation of , the DM provides feedback on one pair of designs chosen uniformly at random from among those previously evaluated. Our framework easily supports other patterns of interaction. For example, it supports a setting where the DM provides feedback in a single batch after the first-stage evaluations of the computational model are complete, either over random previously evaluated attribute vectors or using a more sophisticated and query-efficient selection of attribute vectors (see, e.g., lepird2015bayesian). It also supports a setting in which the DM provides feedback at a random series of time points on pairs of previously evaluated attribute vectors of her choosing.

### 2.4 Statistical Model Over f

As is standard in BayesOpt, we place a (multi-output) Gaussian process (GP) prior on (alvarez2012kernels), , characterized by a mean function, , and a positive definite covariance function, 1. Thus, after observing noise-free evaluations of at points , the estimates of the designs’ attributes are given by the posterior distribution on , which is again a multi-output GP, , where and can be computed in closed form in terms of and (liu2018remarks).

### 2.5 Statistical Model Over U

We use Bayesian preference learning (chu2005preference; lepird2015bayesian) to infer a posterior probability distribution over the utility function, , given preferences expressed by the DM. Although this method is standard in the literature, we describe it here for completeness.

We use a parametric family of utility functions , (following, for example, akrour2014programming; wirth2016model); a prior probability distribution over , ; and a likelihood function, , giving the conditional probability of the DM expressing preference in response to an offered pair of attribute vectors , with utility difference . The posterior distribution over after feedback on pairwise comparisons, written , is then given by Bayes’ rule:

 pθm(θ)∝pθ(θ)∏mL(am;U(ym;θ)−U(y′m;θ)).

In our approach, we rely only on the ability to sample from this posterior distribution.

The most widely used parametric family of utility functions is linear functions, (wirth2017survey), with other examples including linear functions over kernel-based feature spaces (wirth2016model; kupcsik2018learning) and deep neural networks (christiano2017deep). Commonly used likelihood functions include probit and logit (wirth2017survey). In our numerical experiments, for simplicity, we assume fully accurate preference responses, i.e., , with parameteric families and priors described below. Although we assume parametric utility functions, conceptually, our approach generalizes to handle nonparametric Bayesian preference learning (see, e.g., chu2005preference). However, this poses additional computational challenges as our approach internally performs optimization given samples of the utility function, which can be slow for nonparametric models.

### 2.6 Measure of Performance

We suppose that, after evaluations of the computational model (and judgments on attribute vector pairs), the DM selects her most preferred design among all evaluated designs. Thus, the utility generated, given , is

 maxi=1,…,NU(f(xi);θ), (1)

and we wish to adaptively choose designs to evaluate, , to maximize the expected value of (1),

 E[maxi=1,…,NU(f(xi);θ)], (2)

where the expectation is taken over the prior on and the randomness in (induced by the random first stage of samples and randomness in the DM’s responses).

The full BayesOpt with preference learning loop is summarized in Algorithm  1.

## 3 Acquisition Functions

We propose two novel acquisition functions, the Expected Improvement under Utility Uncertainty (EI-UU), and Thompson Sampling under Utility Uncertainty (TS-UU), for selecting points at which to query . The bulk of our development and analysis focuses on EI-UU, since this is the more difficult of the two to optimize, and this acquisition function performs best in numerical experiments. The description of TS-UU is deferred to  C.

### 3.1 Expected Improvement Under Utility Uncertainty (EI-UU)

Expected improvement is arguably the most popular acquisition function in BayesOpt. It has been successfully generalized for multi-objective and constrained optimization (emmerich2006single; gardner14), and we next show that it can be naturally generalized to our setting as well by extending expected improvement’s one-step optimality analysis (jones1998efficient; frazier2018tutorial).

After evaluating designs , the utility obtained by the DM when she selects her most preferred design among this set is

 U∗n(f;θ):=maxi=1,…,nU(f(xi);θ).

On the other hand, if we evaluate one more design, , the utility obtained by the DM increases by

 =max{U(f(x);θ),U∗n(f;θ)}−U∗n(f;θ) ={U(f(x);θ)−U∗n(f;θ)}+.

This difference measures improvement from sampling . Thus, a natural sampling policy is to evaluate the design that maximizes the expected improvement

 EI-UUn(x):=En[{U(f(x);θ)−U∗n(f;θ)}+], (3)

where the expectation is over both and , and indicates that the expectation is computed with respect to their corresponding posterior distributions given the previous computational evaluations, , and DM’s responses, .

We call EI-UU the expected improvement under utility uncertainty and refer to the above policy as the EI-UU policy. By construction, this sampling policy is one-step Bayes optimal.

### 3.2 Computation and Maximization of Ei-Uu

In contrast with the standard expected improvement, EI-UU cannot be computed in closed form. However, as we show next, it can still be efficiently maximized. First, we introduce some notation. Making a slight abuse of notation, we denote by . We also let be the lower Cholesky factor of .

We note that, for any fixed , the time- posterior distribution of is normal with mean and covariance matrix . Therefore, we can express , where is a -variate standard normal random vector, and thus

 EI-UUn(x)=En[{U(μn(x)+Cn(x)Z;θ)−U∗n(f;θ)}+].

This implies that we can compute using Monte Carlo as summarized in Algorithm  2.

In principle, the above is enough to maximize EI-UU using a derivative-free global optimization algorithm (for non-expensive functions). However, we could optimize EI-UU more efficiently if we were able to leverage derivative information; this is the case using the derivative information we construct in the following proposition.

###### Proposition 1.

Under mild regularity conditions, is differentiable almost everywhere, and its gradient, when it exists, is given by

 ∇EI-UUn(x)=En[γn(x,Z;θ)],

where the expectation is over and , and

 γn(x,Z;θ)={0, if U(μn(x)+Cn(x)Z;θ)≤U∗n(f;θ)∇U(μn(x)+Cn(x)Z;θ), otherwise,

where the gradient is with respect to .

Thus, provides an unbiased estimator of , which can be used within a gradient-based stochastic optimization algorithm, such as stochastic gradient ascent, to find stationary points of EI-UU. We may then start stochastic gradient ascent from multiple starting points and use simulation to evaluate the EI-UU for each and select the best. By increasing the number of starting points, we may find a high-quality local optimum and asymptotically find a global optimum.

A formal statement and proof of Proposition  1 can be found in Appendix  A.

### 3.3 Computation of Ei-Uu When U Is Linear

While the above approach can be used for efficiently maximizing EI-UU for general utility functions, we can make maximization even more efficient for linear utility functions, the most widely used class in practice.

###### Proposition 2.

Suppose that and for all and . Then,

 EI-UUn(x)=En[Δn(x;θ)Φ(ζ)+σn(x;θ)φ(ζ)],

where the expectation is over , , , , and and are the standard normal density function and cumulative distribution function, respectively.

The result above shows that, when each is linear, the computation of EI-UU essentially reduces to that of the standard expected improvement, modulo integrating the uncertainty over . In particular, the uncertainty with respect to can be integrated out. Moreover, in this case one can also derive an analogous result to Proposition  1 for computing the gradient of EI-UU in which the explicit dependence on is eliminated as well. Formal statements and proofs of these two result can be found in Appendix  B.

The introduction discusses the two lines of most closely related work: the point-estimate approach pursued within preference-based reinforcement learning (PbRL); and multi-objective BayesOpt.

The most closely related work in PbRL is utility-based PbRL using trajectory utilities (wirth2017survey). This variant of PbRL seeks to design a control policy to maximize the utility of a human subject using features computed from trajectories. Work in this area includes akrour2014programming and wirth2016model. Unlike our work, the uncertainty in utility function estimates is not considered when performing optimization.

Multi-objective BayesOpt includes knowles2006; bautista2009; binois2015quantifying; shah2016pareto; feliot2017bayesian; hl16. Multi-objective optimization cannot easily incorporate prior information about the DM’s preferences, though several attempts have been made, mostly through modified Pareto-dominance criteria or weighted-sum approaches (cvetkovic2002preferences; zitzler2004indicator; rachmawati2006preference). Most of this work is outside the BayesOpt framework, with only three exceptions, which we describe below, known to us.

feliot2018user proposes a weighted version of the expected Pareto hypervolume improvement approach (emmerich2006single) to focus the search on certain regions of the Pareto front. However, no method is provided for choosing weights from data, in contrast with our approach’s ability to learn from interactions with the DM. Moreover, this method suffers the same computational limitations of the standard expected Pareto hypervolume improvement approach, limiting its applicability to at most three objectives (hl16). abdolshah2019multi also proposes a weighted version of the expected Pareto hypervolume improvement approach to explore the region of the Pareto frontier satisfying a preference-order constraint over the objectives. Finally, paria2018flexible proposes an approach based on random scalarizations. In contrast with our approach, no method is available for estimating the distribution of these scalarizations from data.

Another related literature is preferential BayesOpt. Preferential BayesOpt (gonzalez2017preferential) has been applied to realistic material design in computer graphics (brochu2010tutorial) and optimization of a parameterized control policy for robotic object handover in (kupcsik2018learning). To apply preferential BayesOpt in our setting, we would choose pairs of treatments and , evaluate our computational model and for each, and obtain feedback from the DM on which treatment is preferred. Pairs of treatments would then be chosen to best support the goal of finding the DM’s preferred design. Critically, these methods do not use the attributes, , except to present them to the DM, but instead learn preferences directly as a function of . Thus, these methods tend to require many queries of the DM (wirth2017efficient; pinsler2018sample). Our approach leverages attribute observations to be more query efficient.

Our work is also related to a line of research on adaptive utility elicitation (chajewska1998utility; chajewska2000making; boutilier2002pomdp; boutilier2006constraint). Unlike in classical utility elicitation (farquhar1984state; abbas2018foundations), which has accurate estimation as its final goal, this work elicits the DM’s utility function with the final goal of finding a good decision, even if this leaves residual uncertainty about the utility function (braziunas2006computational). However, this work assumes that attributes are inexpensive to evaluate, and that the space of designs is finite, preventing its use in our setting.

Our work builds on BayesOpt (frazier2018tutorial), a framework for optimization of time-consuming-to-evaluate black-box functions. Our proposed EI-UU acquisition function is a natural generalization of the classical expected improvement acquisition function in standard BayesOpt (movckus1975bayesian; jones1998efficient). EI-UU also generalizes the expected improvement for composite functions (astudillo2019bayesian), which can be obtained as a special case when is known.

Our work is also related to frazier2011guessing, which pursued a similar approach for the pure exploration multi-attribute multi-armed bandit problem with linear utility functions and without iterative interaction with the DM. Finally, an earlier version of this work, which considered linear utility functions only and no iterative interaction with the DM, appeared at astudillo2017multi.

## 5 Experiments

We compare the performance of our sampling policies (EI-UU and TS-UU) against the policy that chooses the points to sample uniformly at random (Random), and ParEGO (knowles2006), a popular for multi-objective BayesOpt algorithm. To understand the benefit obtained from preference information within our proposed sampling policies, we also report their performance without preference learning, i.e., where the distribution of remains equal to its prior distribution throughout all evaluations of . In the plots, we distinguish from our sampling policies with preference learning by appending the subindex npl (which stands for “no preference learning”).

In all problems, an initial stage of evaluations is performed using points chosen uniformly at random over . A second stage (pictured in plots) is then performed using the given sampling policy. For our algorithms, the outputs of are modeled using independent GP prior distributions. All GP models in our experiments have a constant mean function and ARD Mateŕn covariance function with smoothness parameter equal to ; the associated hyperparameters are estimated under a Bayesian approach. As proposed in snoek2012practical, for all algorithms, except TS-UU, we use an averaged version of the acquisition function obtained by first drawing 10 samples of the GP hyperparameters, computing the acquisition function conditioned on each of these hyperparameters, and then averaging the results; for TS-UU, a single sample of the GP hyperparameters is used.

In all problems and for each replication, we draw one sample from the prior distribution to obtain a true underlying utility function, , which is used to obtain the preference information from the DM. The performance of the algorithms is reported with respect to this true underlying utility function.

Our code and experiments are available at https://github.com/RaulAstudillo06/BOPL.

### 5.1 Synthetic Test Functions

The first three problems use well known test functions drawn from the evolutionary multi-objective optimization literature (van1999multiobjective; deb2005scalable; knowles2006). We define these functions in detail in Appendix  E.

Results of these experiments are shown on a logarithmic scale in Figures  1,  2, and  3. In these three test problems, EI-UU and TS-UU substantially outperform Random and ParEGO. In the first and third problems, EI-UU outperforms TS-UU, whereas in the second problem the opposite occurs. Throughout these problems, EI-UU greatly benefits from preference information. TS-UU also benefits from preference information, especially in the first two problems

#### DTLZ1a With a Linear Utility

A general form of this test function was first introduced in deb2005scalable. The version we use was defined in knowles2006. This function has attributes and is defined over . In this experiment, we use a linear utility function , and let the prior distribution on be uniform over .

#### DTLZ2 With a Quadratic Utility

This function was first introduced in a general form in deb2005scalable. We use a concrete version of this function with attributes defined over . Here, we use a quadratic utility function , where is uniform over , and consists of 8 points lying in the Pareto front of , obtained as

 Θ={f(x):xi∈{i−13,i3},i≤3, x4,x5=0.5}.

We envision that, in practice, such utility function could be used for finding designs with attributes as close as possible to an uncertain vector of “ideal” attributes, which could take a range of values depending on the type of DM in question.

#### VLMOP3 With an Exponential Utility

This test function first appeared in van1999multiobjective. It has attributes and is defined over . Here, we use an exponential utility function

 U(y;θ)=133∑j=11−exp(−θyj)θ,

and let the prior on be uniform over .

We note that, when from the right, the solution that maximizes converges to the solution that maximizes (neutral risk), whereas when , it converges to the one that maximizes (worst-case risk). Therefore, if denote the outcome of an event under plausible scenarios with known likelihoods, (in the above example (, ), this utility function provides a natural way to optimize the (expected) utility of a DM with uncertain (constant absolute) risk aversion with respect to this outcome.

### 5.2 Portfolio Simulation Optimization

In this test problem, we use our algorithm to tune the hyperparameters of a trading strategy so as to maximize the return of a DM with an unknown risk aversion tolerance. We envision this as modeling a financial advisor that has many clients, each of which requires customized financial planning based on their own portfolio, and has a different risk tolerance. Using choices made by past clients about which financial product they prefer, the financial advisor may form a probability distribution over utility functions to use when using a computationally expensive simulation to develop a menu of options to show a new client.

We use CVXPortfolio (cvxportfolio) to simulate and optimize the evolution of a portfolio over a period of four years, from Jan. 2012 through Dec. 2015, using open-source market data; the details of the simulation can be found in §7.1 of cvxportfolio. Here, has two outputs, the mean and (minus the) standard deviation of the daily returns. We use a non-standard utility function that sets to if and otherwise. This recovers the constrained optimization problem that maximizes subject to the constraint that . Analogous to the case of linear utility functions, discussed in Proposition  2, it can be shown that for this class of utility functions, EI-UU admits an expression similar to that of the constrained expected improvement (gardner14).

Thus, in this setting we wish to maximize average return subject to an unknown constraint on the DM’s risk tolerance level, , which we assume is uniform over (recall that is minus the standard deviation). The hyperparameters to be tuned are the trade, hold, and risk aversion parameters over the domains , , and , respectively. Results are shown in Figure  4. Here, the optimal solution is unknown so we report the utility value instead. As before, EI-UU substantially outperforms Random and ParEGO, and is followed in performance by TS-UU. Both EI-UU and TS-UU benefit from preference information.

### 5.3 Optimization of Ambulance Bases

In this test problem, we optimize the location of three ambulance bases according to the distribution of the response times. We consider attributes, representing the number of response times falling within some given intervals of time, and assume a DM considers these attributes to choose the ideal locations of the ambulance bases. We let be the number of response times falling within , , and be the number of those falling within the interval . Due to the nature of these attributes, which are positive, we model their logarithms as GPs instead of the attributes directly. We then use the utility function

 U(y;θ)=5∑j=1θjexpyj∑5i=1expyi,

which corresponds to a linear utility function over the fraction of response times within the various intervals. Here, is taken to be uniform over the set .

Results of this experiment are shown in Figure  5. As before, EI-UU substantially outperforms Random and ParEGO, and is followed by TS-UU. In contrast with all other test problems, however, here neither EI-UU nor TS-UU seem to benefit from preference information. A closer inspection to the data obtained from this experiment shows that there is highly concentrated region of designs that poses a high utility value for a wide range of values of , which explains this behavior. This also suggests that our sampling policies are able to find robust designs if they exist.

## 6 Conclusion

We introduced multi-attribute Bayesian optimization with preference learning, a novel approach for black-box global optimization of time-consuming-to-evaluate physical or computational experiments with multiple attributes that allows us to accommodate partial preference information in a principled way. By leveraging preference information, our approach is more efficient than multi-objective optimization approaches. By acknowledging uncertainty in the DM’s preferences, our approach is more flexible and robust than single-objective optimization approaches that use a point estimate of the DM’s utility function. Relevant directions for future work include developing more sophisticated policies for selecting the pairs of attributes to be shown to the DM, and using nonparametric models for estimating the DM’s utility function.

#### Acknowledgements

The authors were supported by NSF CMMI-1536895, NSF CMMI-1254298, AFOSR FA9550-15-1-0038, and AFOSR FA9550-19-1-0283. The authors would like to thank anonymous reviewers for their comments.

\putbib{bibunit}

## Appendix A Unbiased Estimator of the Gradient of Ei-Uu

In this section we formally state and prove Proposition 1.

###### Proposition 1.

Suppose that is differentiable for all and let be an open subset of so that and are differentiable on and there exists a measurable function satisfying

1. for all , and .

2. , where is a -variate standard normal random vector independent of , and the expectation is over both and .

Further, suppose that for almost every and the set is countable. Then, EI-UU is differentiable on and its gradient, when it exists, is given by

 ∇EI-UU(x)=E[γ(x,θ,Z)],

where the expectation is over and , and

 γ(x,θ,Z)={∇U(μn(x)+Cn(x)Z;θ), if U(μn(x)+Cn(x)Z)>U∗n(f;θ),0, otherwise.
###### Proof.

From the given hypothesis it follows that, for any fixed and , the function is differentiable on . This in turn implies that the function is continuous on and differentiable at every such that , with gradient equal to . From our assumption that for almost every and the set is countable, it follows that for almost every and the function is continuous on and differentiable on all , except maybe on a countable subset. Using this, along with conditions 1 and 2, and Theorem 1 in l1990unified, the desired result follows. ∎

We note that, if one imposes the stronger condition , then has finite second moment, and thus this unbiased estimator of can be used within stochastic gradient ascent to find a stationary point of EI-UU (bottou1998online).

## Appendix B Computation of Ei-Uu and Its Gradient When U Is Linear

In this section we formally state and prove Propositions 2 and 3.

###### Proposition 2.

Suppose that for all and . Then,

 EI-UU(x)=En[Δn(x;θ)Φ(Δn(x;θ)σn(x;θ))+σn(x;θ)φ(Δn(x;θ)σn(x;θ))]

where the expectation is over , , , and and are the standard normal probability density function and cumulative distribution function, respectively.

###### Proof.

Note that

 EI-UU(x)=En[En[{θ⊤f(x)−U∗n(f;θ)}+∣θ]].

Thus, it suffices to show that

 En[{θ⊤f(x)−U∗n(f;θ)}+∣θ]=Δn(x;θ)Φ(Δn(x;θ)σn(x;θ))+σn(x;θ)φ(Δn(x;θ)σn(x;θ)),

but this can be easily verified by noting that, conditioned on , the time- posterior distribution of is normal with mean and variance . ∎

###### Proposition 3.

Suppose that for all and , and are differentiable, and there exists a function satisfying

1. for all and .

2. .

Then, EI-UU is differentiable and its gradient is given by

 ∇EI-UU(x)=En⎡⎢ ⎢⎣(θ⊤∇μn(x))Φ(Δn(x;θ)σn(x;θ))+φ(Δn(x;θ)σn(x;θ))2σn(x;θ)m∑i,j=1θiθj∇Kn(x)i,j⎤⎥ ⎥⎦.
###### Proof.

Recall that

 En[{θ⊤f(x)−U∗n(f;θ)}+∣θ]=Δn(x;θ)Φ(Δn(x;θ)σn(x;θ))+σn(x;θ)φ(Δn(x;θ)σn(x;θ)).

Moreover, standard calculations show that

 ∇[Δn(x;θ)Φ(Δn(x;θ)σn(x;θ))]=(θ⊤∇μn(x))Φ(Δn(x;θ)σn(x;θ))+Δn(x;θ)φ(Δn(x;θ)σn(x;θ))∇Δn(x;θ)σn(x;θ),

and

 ∇[σn(x;θ)φ(Δn(x;θ)σn(x;θ))] =φ(Δn(x;θ)σn(x;θ))2σn(x;θ)m∑i,j=1θiθj∇Kn(x)i,j+σn(x;θ)[−Δn(x;θ)σn(x;θ)φ(Δn(x;θ)σn(x;θ))∇Δn(x;θ)σn(x;θ)] =φ(Δn(x;θ)σn(x;θ))2σn(x;θ)m∑i,j=1θiθj∇Kn(x)i,j−Δn(x;θ)φ(Δn(x;θ)σn(x;θ))∇Δn(x;θ)σn(x;θ).

Thus, is a differentiable function of , and its gradient is given by

 ∇En[{θ⊤f(x)−U∗n(f;θ)}+∣θ]=(θ⊤∇μn(x))Φ(Δn(x;θ)σn(x;θ))+φ(Δn(x;θ)σn(x;θ))2σn(x;θ)m∑i,j=1θiθj∇Kn(x)i,j.

From conditions 1 and 2, and theorem 16.8 in billingsley1995probability, it follows that EI-UU is differentiable and its gradient is given by

 ∇EI-UU(x)=En[∇En[{θ⊤f(x)−U∗n(f;θ)}+∣θ]]

i.e.,

 ∇EI-UU(x)=En⎡⎢ ⎢⎣(θ⊤∇μn(x))Φ(Δn(x;θ)σn(x;θ))+φ(Δn(x;θ)σn(x;θ))2σn(x;θ)m∑i,j=1θiθj∇Kn(x)i,j⎤⎥ ⎥⎦.

We end by noting that if is compact and and are both continuously differentiable, then

 (θ,x)→∥∥ ∥ ∥∥(θ⊤∇μn(x))Φ(Δn(x;θ)σn(x;θ))+φ(Δn(x;θ)σn(x;θ))2σn(x;θ)m∑i,j=1θiθj∇Kn(x)i,j∥∥ ∥ ∥∥

is continuous and thus attains its maximum value on (recall that is compact as well). Thus, in this case conditions 1 and 2 are satisfied by the constant function

 η≡max(θ,x)∈Θ×X∥∥ ∥ ∥∥(θ⊤∇μn(x))Φ(Δn(x;θ)σn(x;θ))+φ(Δn(x;θ)σn(x;θ))2σn(x;θ)m∑i,j=1θiθj∇Kn(x)i,j∥∥ ∥ ∥∥.

## Appendix C Thompson Sampling Under Utility Uncertainty (Ts-Uu)

Thompson sampling for utility uncertainty (TS-UU) generalizes the well-known Thompson sampling method (thompson1933likelihood) to our setting. TS-UU works as follows. It first samples from its posterior distribution. Then, it samples from its Gaussian process posterior distribution. The point at which it evaluates next is the one that maximizes for the samples of and . This contrasts with the point-estimate approach in that it samples from its posterior rather than simply setting it equal to a point estimate. For example, if we implemented this point-estimate approach using standard Thompson sampling, we would sample only from its posterior and then optimize where is a point estimate, such as the maximum a posteriori estimate. TS-UU can induce substantially more exploration than this more classical approach.

TS-UU can be implemented by sampling over a grid of points if is low-dimensional. It can also be implemented for higher-dimensional by optimizing with a method for continuous nonlinear optimization (like CMA, hansen2016cma), lazily sampling from the posterior on each new point that CMA wants to evaluate, conditioning on previous real and sampled evaluations. We use the latter approach in our numerical experiments.

## Appendix D Exploration and Exploitation Trade-Off

One of the key properties of the classical expected improvement acquisition function is that it is increasing with respect to both the posterior mean and variance. This means that it prefers to sample points that are either promising with respect to our current knowledge or are still highly uncertain, an appealing property for a sampling policy aiming to balance exploitation and exploration. The following result shows that, under certain conditions, the EI-UU sampling policy satisfies an analogous property.

###### Proposition 4.

Suppose that for every is convex and non-decreasing. Also suppose are such that and , where the first inequality is coordinate-wise and denotes the partial order defined by the cone of positive semi-definite matrices. Then,

 EI-UUn(x)≥EI-UUn(x′).
###### Proof.

Since , we have that , where is a -variate normal random vector with zero mean and covariance matrix independent of . Thus,

 En[{U(f(x);θ)−U∗n(f;θ)}+∣θ] =En[{U(f(x′)+(μn(x)−μn(x′))+W;θ)−U∗n(f;θ)}+∣θ] ≥En[{U(f(x′)+W;θ)−U∗n(f;θ)}+∣θ] =En[En[{U(f(x′)+W;θ)−U∗n(f;θ)}+∣θ,f(x′)]] ≥En[{U(f(x′);θ)−U∗n(f;θ)}+∣θ],

where the first and second inequalities follow from the fact that the function is increasing and convex, respectively, along with Jensen’s inequality. Finally, taking expectations with respect to yields the desired result. ∎

This result implies, for example, that for linear utility functions, the EI-UU sampling policy exhibits the behavior described above. We also note, however, that most utility functions used in practice are concave instead of convex.

## Appendix E Synthetic Test Functions Definitions

### e.1 DTLZ1a

A general form of this test function was first introduced in deb2005scalable. The version we use was defined in knowles2006. It is defined over , and has attributes given by

 f1(x) =−0.5x1((1+g(x)) f2(x) =−0.5(1−x1)((1+g(x)),

where

 g(x)=100(5+6∑i=2[(xi−0.5)2−cos(2π(xi−0.5))]).

The Pareto optimal set of designs consists of those such that , , and may take any value in . The Pareto front is a segment of the hyperplane .

### e.2 Dtlz2

This function was first introduced in a general form in deb2005scalable. In our experiment, we use a concrete version of it with attributes defined over . The attributes are

 f1(x) =−(1+g(x))3∏i=1cos(π2xi) f2(x) f3(x) =−(1+g(x))cos(π2x1)sin(π2x2), f4(x) =−(1+g(x))sin(π2x1),

where

 g(x)=5∑i=4(xi−0.5).

.

### e.3 Vlmop3

This test function first appeared in van1999multiobjective. It is defined over and has attributes given by

 f1(x) =−0.5(x21+x22)−sin(x21+x22), f2(x) =−(3x1−2x2+4)28−(x1−x2+1)227−15, f3(x) =−1x21+x22+1+1.1exp(−x21−x22).
\putbib

### Footnotes

1. denotes the cone of positive definite matrices.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters

410746

How to quickly get a good answer:
• Keep your question short and to the point
• Check for grammar or spelling errors.
• Phrase it like a question
Test
Test description