1 Introduction

Abstract

We propose a novel, theoretically-grounded, acquisition function for batch Bayesian optimization informed by insights from distributionally ambiguous optimization. Our acquisition function is a lower bound on the well-known Expected Improvement function – which requires a multi-dimensional Gaussian Expectation over a piecewise affine function – and is computed by evaluating instead the best-case expectation over all probability distributions consistent with the same mean and variance as the original Gaussian distribution. Unlike alternative approaches including Expected Improvement, our proposed acquisition function avoids multi-dimensional integrations entirely, and can be computed exactly as the solution of a convex optimization problem in the form of a tractable semidefinite program (SDP). Moreover, we prove that the solution of this SDP also yields exact numerical derivatives, which enable efficient optimization of the acquisition function. Finally, it efficiently handles marginalized posteriors with respect to the Gaussian Process’ hyperparameters. We demonstrate superior performance to heuristic alternatives and approximations of the intractable expected improvement, justifying this performance difference based on simple examples that break the assumptions of state-of-the-art methods.

1 Introduction

When dealing with numerical optimization problems in engineering applications, one is often faced with the optimization of an expensive process that depends on a number of tuning parameters. Examples include the outcome of a biological experiment, the training of large scale machine learning algorithms or the outcome of exploratory drilling. We are concerned with problem instances wherein there is the capacity to perform experiments in parallel and, if needed, repeatedly select further batches with cardinality as part of some sequential decision making process. Given the cost of the process, we wish to select the parameters at each stage of evaluations carefully so as to optimize the process using as few experimental evaluations as possible.

2 Bayesian optimization

It is common to assume a surrogate model for the outcome of the process to be optimized. This model, which is built based on both prior assumptions and past function evaluations, is used to determine a collection of input points for the next set of evaluations. Bayesian Optimization provides an elegant surrogate model approach and has been shown to outperform other state-of-the-art algorithms on a number of challenging benchmark functions (Jones, 2001). It models the unknown function with a Gaussian Process (GP) (Rasmussen and Williams, 2005), a probabilistic function approximator which can incorporate prior knowledge such as smoothness, trends, etc.

A comprehensive introduction to GPs can be found in (Rasmussen and Williams, 2005). In short, modeling a function with a GP amounts to modelling the function as a realization of a stochastic process. In particular, we assume that the outcomes of function evaluations are normally distributed random variables with known prior mean function and prior covariance function . Prior knowledge about , such as smoothness and trends, can be incorporated through judicious choice of the covariance function , while the mean function is commonly assumed to be zero. A training dataset of past function evaluations for , with is then used to calculate the posterior of .

The GP regression equations (Rasmussen and Williams, 2005) give this posterior on a batch of test locations as

 y|D∼N(μ(X),Σ(X)) (1)

with

 μ(X)= K(Xd,X)TK(Xd,Xd)−1yd, (2) Σ(X)= K(X,X)− K(Xd,X)TK(Xd,Xd)−1K(Xd,X) (3)

where is a matrix containing the prior covariances between every pair of training and test points generated by (similarly for ). The posterior mean value and variance depend also on the dataset , but we do not explicitly indicate this dependence in order to simplify the notation. Likewise, the posterior is a normally distributed random variable whose mean and covariance depend on , but we do not make this explicit.

Given the surrogate model, we wish to identify some selection criterion for choosing the next batch of points to be evaluated. Such a selection criterion is known as an acquisition function in the terminology of Bayesian Optimization. Ideally, such an acquisition function would take into account the number of remaining evaluations that we can afford, e.g. by computing a solution via dynamic programming to construct an optimal sequence of policies for future batch selections. However, a probabilistic treatment of such a criterion is computationally intractable, involving multiple nested minimization-marginalization steps (GonzÃ¡lez et al., 2016).

To avoid this computational complexity, myopic acquisition functions that only consider the one-step return are typically used instead. For example, one could choose to minimize the one-step Expected Improvement (described more fully in §3) over the best evaluation observed so far, or maximize the probability of having an improvement in the next batch over the best evaluation. Other criteria use ideas from the bandit (Desautels et al., 2014) and information theory (Shah and Ghahramani, 2015) literature. In other words, the intractability of the multistep lookahead problem has spurred instead the introduction of a wide variety of myopic heuristics for batch selection.

3 Expected improvement

We will focus on the (one-step) Expected Improvement criterion, which is a standard choice and has been shown to achieve good results in practice (Snoek et al., 2012). In order to give a formal description we first require some definitions related to the optimization procedure of the original process. At each step of the optimization define as the vector of past function values evaluated at the points , and as a candidate set of points for the next batch of evaluations. Then the expected improvement acquisition function is defined as

 α(X)=E[min(y1,…,yk,yd–––)|D]−yd–––withy|D∼N(μ(X),Σ(X)) (4)

where is the element-wise minimum of , i.e. the minimum value of the function achieved so far by any known function input. Selection of a batch of points to be evaluated with optimal expected improvement then amounts to finding some

Unfortunately, direct evaluation of the acquisition function requires the –dimensional integration of a piecewise affine function; this is potentially a computationally expensive operation. This is particularly problematic for gradient-based optimization methods, wherein may be evaluated many times when searching for a minimizer. Regardless of the optimization method used, such a minimizer must also be computed again for every step in the original optimization process, i.e. every time a new batch of points is selected for evaluation. Therefore a tractable acquisition function should be used. In contrast to (4), the acquisition function we will introduce in section 4 avoids expensive integrations and can be calculated efficiently with standard software tools.

Despite these issues, Chevalier and Ginsbourger (2013) presented an efficient way of approximating and its derivative (Marmin et al., 2015) by decomposing it into a sum of –dimensional Gaussian Cumulative Distributions, which can be calculated efficiently using the seminal work of Genz and Bretz (2009). There are two issues with this approach: First the number of calls to the –dimensional Gaussian Cumulative Distribution grows quadratically with respect to the batch size , and secondly, there are no guarantees about the accuracy of the approximation or its gradient. Indeed, approximations of the multi-point expected improvement, as calculated with the R package DiceOptim (Roustant et al., 2012) can be shown to be arbitrarily wrong in trivial low-dimensional examples (see Figure 3). To avoid these issues, Gonzalez et al. (2016) and Ginsbourger et al. (2009) rely on heuristics to derive a multi-point criterion. Both methods choose the batch points in a greedy, sequential way, which restricts them from exploiting the interactions between the batch points in a probabilistic manner.

4 Distributionally ambiguous optimization for Bayesian optimization

We now proceed to the main contribution of the paper, which draws ideas from the Distributionally Ambiguous Optimization community to derive a novel, tractable acquisition function which lower bounds the expectation in (4). Our acquisition function:

• is theoretically grounded;

• is numerically reliable and consistent, unlike Expected Improvement-based alternatives (see §6);

• is fast and scales well with the batch size;

• provides gradients as a direct by-product of the function evaluation; and

• handles efficiently marginalized posteriors with respect to the GP’s hyperparameters

In particular, we use the posterior result (1)–(2) derived from the GP to determine the mean and variance of given a candidate batch selection , but we thereafter ignore the Gaussian assumption and consider only that has a distribution embedded within a family of distributions that share the mean and covariance calculated by (2) and (2). In other words, we define

 P(μ,Σ)=\setPEP[ξ]=μ,EP[ξξT]=Σ.

We denote the set simply as where the context is clear. Note in particular that for any choice of mean or covariance .

One can then construct upper and lower bound for the Expected Improvement by maximizing or minimizing over the set , i.e. by writing

 infP∈PEP[g(ξ)]≤EN(μ,Σ)[g(ξ)]≤supP∈PEP[g(ξ)] (5)

where the random vector and the function are chosen such that , i.e. and

 g(ξ)=min(ξ1,…,ξk,yd–––)−yd–––. (6)

Observe that the middle term in (5) is equivalent to the expectation in (4).

Perhaps surprisingly, both of the bounds in (5) are computationally tractable even though they seemingly require optimization over the infinite-dimensional (but convex) set of distributions . For either case, these bounds can be computed exactly via transformation of the problem to a tractable, convex semidefinite optimization problem use distributionally ambiguous optimization techniques (Zymler et al., 2013).

This computational tractability comes at the cost of inexactness in the bounds (5), which is a consequence of minimizing (or maximizing) over a set of distributions containing the Gaussian distribution as just one of its members. We show in §6 that this inexactness is of limited consequence in practice. Nevertheless, there remains significant scope for tightening the bounds in (5) via imposition of additional convex constraints on the set , e.g. by restricting to symmetric or unimodal distributions (Parys et al., 2015). Most of the results in this work would still apply, mutatis mutandis, if such structural constraints were included.

In contrast, the distributionally ambiguity is useful for integrating out the uncertainty of the GP’s hyperparameters efficiently. To see this, first define the second order moment matrix of the posterior as

 Ω\eqdef[Σ+μμTμμT1], (7)

which we will occasionally write as to highlight the dependency of the second order moment matrix on . Given samples of the hyperparameters, resulting in second order moment matrices , we can estimate the resulting second moment matrix of the marginalized, non-Gaussian, posterior as

 ¯Ω≈1qq∑i=1Ωi

Due to the distributionally ambiguity of our approach, both bounds of (5) can be calculated directly based on , hence avoiding multiple calls to the acquisition function.

We will focus on the lower bound in (5), hence adopting an optimistic modelling approach.1 Informally, we are optimistically assuming that the distribution of function values gives as low a function value as is compatible with the mean and covariance (computed by the GP).

The following result says that the lower (i.e. optimistic) bound in (5) can be computed via the solution of a convex optimization problem whose objective function is linear in :

{theorem}

The optimal value of the semi-infinite optimization problem

 infP∈PEP[g(ξ)]

coincides with the optimal value of the following semidefinite program:

 p(Ω)\eqdef sup ⟨Ω,M⟩−yd––– (P) s.t. M−Ci⪯0,i=0,…,k,

where is the decision variable and

 C0 \eqdef[000Tyd–––], (8) Ci \eqdef[0ei/2eTi/20],i=1,…,k

are auxiliary matrices defined using and the standard basis vectors in .

Proof.

See Appendix A. ∎

Problem () is a semidefinite program (SDP). SDPs are convex optimization problems with a linear objective and convex conic constraints (i.e. constraints over the set of symmetric matrices , positive semidefinite/definite matrices /). Hence it can be solved to global optimality with standard software tools (Sturm, 1999; O’Donoghue et al., 2016a). We therefore propose the computationally tractable acquisition function

 ¯α(X)\eqdefp(Ω(X))≤α(X)∀X∈Rk×n,

which is an optimistic variant of the Expected Improvement function in (4).

Although the value of for any fixed is computable via solution of an SDP, the complex dependence (2)–(2) that defines the mapping means that is still non-convex in . This is unfortunate, since we ultimately wish to minimize in order to identify the next batch of points to be evaluated experimentally.

However, it is still possible to compute a local solution via local optimization if an appropriate gradient can be computed. The next result establishes that this is always possible provided that : {theorem} is differentiable on with , where is the unique optimal solution of () at .

The preceding result shows that is produced as a byproduct of evaluation of , since it is simply the unique optimizer of (). Hence we can evaluate via using the optimal solution and application of the chain rule, i.e.

 ∂¯α(X)∂Xi,j=\innerprod∂p(Ω)∂Ω∂Ω(X)∂X(i,j)=\innerprod¯M(Ω)∂Ω(X)∂X(i,j)

Note that the second term in the rightmost inner product above depends on the particular choice of covariance function . This computation is standard for many choices of covariance function, and many standard software tools, including GPy and GPflow, provide the means to compute efficiently given .

We are now in a position to present an algorithm that summarizes our proposed Bayesian Optimisation method using our proposed acquisition function. In the sequel we will present timing and numerical results to demonstrate the validity of this algorithm.

5 Timing

Our acquisition function is evaluated via solution of a semidefinite program, and as such it benefits from the huge advances of the convex optimization field. A variety of standard tools exist for solving such problems accurately, such as the commercial second-order solver MOSEK (MOSEK, ). However, although second order solvers require few iterations and give very accurate results they are not applicable on very large problems (Boyd and Vandenberghe, 2004). On the other hand, first-order solvers, such as SCS (O’Donoghue et al., 2016b) or CDCS (Zheng et al., 2017), scale well with the problem size, are amenable to parallelism, and implementations on graph-based GPU acceleration frameworks have recently been suggested (Wytock et al., 2016). They also allow for solver warm-starting between acquisition function evaluations, which results in significant speedup for our case since we repeatedly evaluate the acquisition function at nearby points to perform gradient-based optimization. In total, for the challenging optimization of the 5-d Alpine-1 function, the calculation of OEI and its derivative is (on average) 10, 100 and 1000 times faster than DiceOptim’s QEI derivative for batch sizes of 2, 7, and 16 respectively, when using SCS as a solver with warm starting.

6 Empirical analysis

In this section we demonstrate the effectiveness of our acquisition function against a number of state-of-the-art alternatives. The acquisition functions we consider are listed in Table 1. In the implementation of our own acquisition function, OEI, we produce both acquisition function values and gradients by solving the semidefinite program () using the solver MOSEK  (MOSEK, ) accessed through the Python-based convex optimization package CVXPY (Diamond and Boyd, 2016).

We show that OEI achieves better performance than alternatives and highlight simple failure cases exhibited by competing methods. In making the following comparisons, extra care should be given in the setup used. This is because Bayesian Optimisation is a multifaceted procedure that depends on a collection of disparate elements (e.g. kernel/mean function choice, normalization of data, acquisition function, optimization of the acquisition function) each of which can have a considerable effect on the resulting performance (Snoek et al., 2012). For this reason we test the different algorithms on a unified testing framework, based on GPflow, which will be released upon acceptance, where all the elements are the same unless stated otherwise.

We first demonstrate that the competing Expected Improvement based algorithms produce clearly suboptimal choices in a simple 1-dimensional example. We consider an 1-d Gaussian Process on the interval with a squared exponential kernel (Rasmussen and Williams, 2005) of lengthscale , variance , noise and a mean function . An example posterior of 10 observations is depicted in Figure 1. Given the GP and the 10 observations, we depict the optimal 3-point batch chosen by minimizing each acquisition function. Note that in this case we assume perfect modelling assumptions: the GP is completely known and representative of the actual process. We can observe in Figure 1 that the OEI choice is very close to the one proposed by QEI while being slightly more explorative, as OEI allows for the possibility of more exotic distributions than the Gaussian.

In contrast the LP-EI heuristic samples three times almost at the same point. This can be explained as follows: LP-EI is based on a Lipschitz criterion to penalize areas around previous choices. However, the Lipschitz constant for this function is dominated by the end points of the function (due to the quadratic trend), which is clearly not suitable for the area of the minimizer (around zero), where the Lipschitz constant is approximately zero. On the other hand, QEI-CL favors suboptimal regions. This is because QEI-CL postulates outputs equal to the mean value of the observations which significantly alter the GP posterior.

Testing the algorithms on 200 different posteriors, generated by creating sets of 10 observations drawn from the previously defined GP, suggests OEI as the clear winner. For each of the 200 posteriors, each algorithm chooses a batch, the performance of which is evaluated by sampling the multipoint expected improvement (4). The averaged results are depicted in Figure 2. For a batch size of 1 all of the algorithms perform the same, except for OEI which performs slightly worse due to the relaxation of Gaussianity. For batch sizes 2-3, QEI is the best strategy, while OEI is a very close second. For batch sizes 4-5 OEI performs significantly better. The deterioration of the performance for QEI in batch sizes 4 and 5 can be explained in Figure 3. In particular, after batch size 3, the calculation of QEI via the R package DiceOptim can become arbitrarily bad, leading to poor choices. A bug report will be submitted to the authors of DiceOptim after the end of the double-blind review process. Figure 3 also explains the very good performance of OEI. Although always different from the sampled estimate, it is reliable and closely approximates the actual expected improvement in the sense that their optimizers and level sets are in close agreement.

We next evaluate the performance of OEI in minimizing synthetic benchmark functions. We consider a mixture of cosines defined in , the Branin-Hoo function, defined on , the Six-Hump Camel function defined on and the Hartman 6 dimensional function defined on . We compare the performance of OEI against QEI, LP-EI and BLCB as well as random uniform sampling. The initial dataset consists of 5 and 20 random points for the 2d and 6d functions respectively, while the batch size is equal to 5 for all the experiments.

A squared exponential kernel with automatic relevance determination is used for the GP modelling (Rasmussen and Williams, 2005). Since none of the competitor algorithms can efficiently handle uncertainty in the hyperparameters, point estimates acquired by maximum likelihood are used for them. In contrast, the distributional ambiguity of OEI allows efficient sampling of the GP posterior to integrate out the uncertainty of hyperparameters, as explained in §4. We choose a Gamma Prior of shape and scale for the lengthscales and a Gaussian Prior with for the kernel’s variance. We then draw samples from the posterior of the hyperparameters given the data via Hamiltonian Monte Carlo. For the purpose of generality, the input domain of every function is scaled to while is scaled at each iteration, such as . The same scalings are applied for QEI, LP-EI and BLCB for reasons of consistency. The acquisition functions are optimized with the quasi-Newton L-BFGS-B algorithm (Fletcher, 1987) with 20 random restarts.

The results are presented in Figure 4. All of the methods except BLCB managed to identify the global minimum of the three two-dimensional functions (Cosines, Branin-Hoo, Six-Hump Camel). In these functions, LP-EI, QEI and OEI perform similarly. However, LP-EI fails to run in the branin function for most of the tested seeds due to an error in the computation of the gradients of the acquisition function. A bug report will be submitted to the authors of GPyOpt (the official package for LP-EI) after the end of the double-blind review process. But, most importantly, in the challenging case of the 6-dimensional Hartman function our algorithm achieves the best performance, with BLCB also demonstrating good performance in the longer run. Note that a log transformation is applied to the Hartman function, following Jones et al. (1998).

7 Conclusions

We have introduced a new acquisition function that is a tractable, probabilistic relaxation of the multi-point Expected Improvement, drawing ideas from the Distributionally Ambiguous Optimization community. Our acquisition function can be computed exactly via the solution of a semidefinite program, with its gradient available as a direct byproduct of this solution. In contrast to alternative approaches, our proposed acquisition function scales well with batch size, does not rely on heuristics (which we demonstrate can fail even in simple cases) and directly considers the correlation between batch points, achieving performance close to the actual multi-point Expected Improvement while remaining computationally tractable. Finally, its distributionally ambiguous approach allows for an effient handling of marginalized posterios with respect to the GP’s hyperparameters. Future work includes exploiting related results from the Optimization community: sensitivity analysis on Semidefinite Programming (Freund and Jarre, 2004) can provide the Hessian of the acquisition function allowing a Newton-based method to be used in the challenging problem of minimizing the high-dimensional acquisition function. Moreover, all the work presented here can be applied to Student t-processes (Shah et al., 2014), the results of which will explored in a subsequent publication.

Acknowledgments

This work was supported by the EPSRC AIMS CDT grant EP/L015987/1 and Schlumberger.

Appendix A Value of the Expected Improvement Lower Bound

In this section we provide a proof of the first of our main results, Theorem 4, which establishes that one can compute the value of our optimistic lower bound function

 infP∈P(μ,Σ)EP[g(ξ)] (9)

via solution of a convex optimization problem in the form of a semidefinite program.

Proof of Theorem 4:

Recall that the set is the set of all distributions with mean and covariance . Following the approach of Zymler et al. (2013), we first remodel problem (9) as:

 infν∈M+ ∫Rkg(ξ)ν(dξ) (10) s.t. ∫Rkν(dξ)=1 ∫Rkξν(dξ)=μ ∫RkξξTν(dξ)=Σ+μμT,

where represents the cone of nonnegative Borel measures on . The optimization problem (10) is a semi-infinite linear program, with infinite dimensional decision variable and a finite collection of linear equalities in the form of moment constraints.

As shown by Zymler et al. (2013), the dual of problem (10) has instead a finite dimensional set of decision variables and an infinite collection of constraints, and can be written as

 sup ⟨Ω,M⟩ (11) s.t. [ξT1]M[ξT1]T≤g(ξ)∀ξ∈Rk,

with the decision variable and the second order moment matrix of (see (7)). Strong duality holds between problems (10) and (11), i.e. there is zero duality gap and their optimal values coincide.

The dual decision variables in (11) form a matrix of Lagrange multipliers for the constraints in (10) that is block decomposable as

 M=(M11m12mT12m22),

where are multipliers for the second moment constraint, multipliers for the mean value constraint, and a scalar multiplier for the constraint that should integrate to , i.e. that should be a probability measure.

For our particular problem, we have as defined in (6), so that (11) can be rewritten as

 sup ⟨Ω,M⟩−yd––– (12) s.t. [ξT1]M[ξT1]T≤ξ(i)∀ξ∈Rk [ξT1]M[ξT1]T≤yd–––,i=1,…,k

The infinite dimensional constraints in (12) can be replaced by the equivalent conic constraints

 M−[0ei/2eTi/20] ⪯0,i=1,…,k and M−[000Tyd–––] ⪯0,

respectively, where are the standard basis vectors in . Substituting the above constraints in (12) results in (), which proves the claim. ∎

Appendix B Gradient of the Expected Improvement Lower Bound

In this section we provide a proof of our second main result, Theorem 4, which shows that the gradient of our lower bound function (5) with respect to coincides with the optimal solution of the semidefinite program ().

Before proving Theorem 4 we require two ancillary results. The first of these results establishes that any feasible point for the optimization problem () has strictly negative definite principal minors in the upper left hand corner. {lemma} For any feasible of () the upper left matrix is negative definite.

Proof.

Let

 M=[M11m12mT12m22]

where and .

Using the definitions of for from (8), the semidefinite constraint in the optimization problem () can be expressed as

 xTM11x+2mT12xz+(m22−yd–––)z2 ≤0 (13) xTM11x+2(m12−ei2)Txz+m22z2 ≤0 (14) ∀x∈Rk,z∈R,i=1,…,k

We can easily see that by substituting in (14). It remains to show that this matrix is actually sign definite, which amounts to showing that it is full rank.

Assume the contrary, so that there exists some nonzero satisfying . Then, (14) gives . When for some , a sufficiently small can be chosen such that and the constraint (14) is violated.

On the other hand, when we can conclude that , which means that where and . In that case (13) is violated for a sufficiently small positive . Hence is full rank by contradiction and . ∎

Our second ancillary result considers the gradient of the function when its argument is varied linearly along some direction .

{lemma}

Given any and any moment matrix , define the scalar function as

 q(γ;Ω)\eqdefp(Ω+γ¯Ω).

Then is differentiable at with , where is the optimal solution of () at .

Proof.

Define the set as

 ΓΩ:=\setγγ∈\domainq(⋅;Ω)=\setγ(Ω+γ¯Ω)∈\domainp,

i.e. the set of all for which problem () has a bounded solution given the moment matrix . In order to prove the result it is then sufficient to show:

1. , and

2. The solution of () at is unique.

The remainder of the proof then follows from (Goldfarb and Scheinberg, 1999, Lemma 3.3), wherein it is shown that implies that is a subgradient of at , and subsequent remarks in Goldfarb and Scheinberg (1999) establish that uniqueness of the solution ensure differentiability.

We will show that both of the conditions (i) and (ii) above are satisfied. The proof relies on the Lagrange dual of problem (), which we can write as

 inf k∑i=0⟨Yi,Ci⟩−yd––– (D) s.t. Yi⪰0,i=0,…,k k∑i=0Yi=Ω.

(i): Proof that 0∈\interiorΓΩ:

It is well-known that if both of the primal and dual problems () and () are strictly feasible then their optimal values coincide, i.e. Slater’s condition holds and we obtain strong duality; see (Boyd and Vandenberghe, 2004, Section 5.2.3) and (Ramana et al., 1997).

For () it is obvious that one can construct a strictly feasible point. For (), defines a strictly feasible point for any . Hence () is solvable for any with sufficiently small. As a result, .

(ii): Proof that the solution to (P) at Ω is unique:

We can prove uniqueness by examining the Karush-Kuhn-Tucker conditions for a pair of primal and dual solutions , :

 Ci−¯M ⪰0 (15) ¯Yi ⪰0 (16) ⟨¯Yi,¯M−Ci⟩=0⇒¯Yi(¯M−Ci) =0 (17) ∂L(M,Ω)∂M∣∣∣¯M=0⇒k∑i=0¯Yi =Ω, (18)

where denotes the Lagrangian of (). As noted before, such a pair of solutions exists when . First, we will prove that and . Lemma B implies that (recall that is nonzero only in the last column or the last row), which means that . Due to the complementary slackness condition (17), the span of is orthogonal to the span of and consequently . However, according to (18) we have

 \rankk∑i=0¯Yi=\rank(Ω)\lx@stackrelΩ≻0⟹k∑i=0\rank(¯Yi)≥k+1 (19)

which results in

 \rank(¯M−Ci)=k,\rank(¯Yi)=1, (20) with R(¯Yi)=N(¯M−Ci),i=0,…,k (21)

where the final equality uses (17), and where and denote the kernel and image of a matrix, respectively.

Since for every the kernel is of dimension one, we can uniquely identify (up to a sign change) a vector in the kernel with unit norm. Moreover, according to (20) we can represent every as

 ¯Yi=λininTi,∀i=0,…,k. (22)

for some . The positivity of comes from the fact that is of rank one.

We next show that the set of null vectors are the same (up to a sign change) for every primal-dual solution. Assume that , with is also optimal and are the unit norm null vectors of . By definition we have

 ~nTi(δM+¯M−Ci)~ni=0i=0,…,k. (23)

However, since is feasible we have , which results in

 ~nTiδM~ni≥0,i=0,…,k. (24)

Since and have the same objective value we conclude that . Moreover, according to (18) and (22) we have the conditions for some . Hence

 \tr(ΩδM)=0⇒\tr(δMk∑i=0~λi~ni~nTi)=0

for some , which in turn implies that

 k∑i=0~λi\tr(δM~ni~nTi) =0 ⟹ k∑i=0~λi~nTiδM~ni =0 \lx@stackrel(???)⟹ ~λi~nTiδM~ni =0 ∀i=0,…,k \lx@stackrel(???)⟹ ~ni(¯M−Ci)~nTi =0 ∀i=0,…,k.

Since , this proves that is the uniquely defined unit norm null vector (up to a change in sign) of . This proves that .

Given the null vectors , the scalar values can be uniquely defined using (18), as they are the coefficients of the decomposition of the rank matrix to the rank matrices . Thus, given the uniqueness of across every solution , (22) proves uniqueness of , i.e. uniqueness of the dual solution. Now we can easily prove uniqueness for the primal solution. Summing (17) gives

 k∑i=0¯Yi(¯M−Ci)=0 \lx@stackrel(???)⇔Ω¯M=k∑i=0¯YiCi ⇔¯M=Ω−1k∑i=0¯YiCi.\qed

Proof of Theorem 4:

Given the preceding support results of this section, we are now in a position to prove Theorem 4.

We begin by considering the derivative of the solution of () when perturbing across a specific direction , i.e. with . Lemma B shows that when . The proof then follows element-wise from Lemma B by choosing a sparse matrix with the only nonzero element. ∎

Footnotes

1. It can be shown that the upper bound in (5) is trivial to evaluate and is not of practical use.

References

1. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
2. Clément Chevalier and David Ginsbourger. Fast Computation of the Multi-Points Expected Improvement with Applications in Batch Selection, pages 59–69. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. ISBN 978-3-642-44973-4.
3. Thomas Desautels, Andreas Krause, and Joel W. Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research, 15:4053–4103, 2014.
4. Steven Diamond and Stephen Boyd. Cvxpy: A python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
5. R. Fletcher. Practical Methods of Optimization; (Second Edition). Wiley-Interscience, New York, NY, USA, 1987. ISBN 0-471-91547-5.
6. Roland W. Freund and Florian Jarre. A sensitivity result for semidefinite programs. Oper. Res. Lett., 32(2):126–132, March 2004. ISSN 0167-6377.
7. Alan Genz and Frank Bretz. Computation of Multivariate Normal and T Probabilities. Springer Publishing Company, Incorporated, 1st edition, 2009. ISBN 364201688X, 9783642016882.
8. David Ginsbourger, Rodolphe Le Riche, Laurent Carraro, and DÃ©partement Mi. A multi-points criterion for deterministic parallel global optimization based on gaussian processes. Journal of Global Optimization, in revision, 2009.
9. David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. Kriging Is Well-Suited to Parallelize Optimization, pages 131–162. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. ISBN 978-3-642-10701-6.
10. D. Goldfarb and K. Scheinberg. On parametric semidefinite programming. Applied Numerical Mathematics, 29(3):361 – 377, 1999. ISSN 0168-9274.
11. Javier Gonzalez, Zhenwen Dai, Philipp Hennig, and Neil Lawrence. Batch bayesian optimization via local penalization. In Arthur Gretton and Christian C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 648–657, Cadiz, Spain, 09–11 May 2016. PMLR.
12. Javier GonzÃ¡lez, Michael A. Osborne, and Neil D. Lawrence. GLASSES : Relieving The Myopia Of Bayesian Optimisation. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016.
13. GPyOpt. A bayesian optimization framework in python.
14. Donald R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21(4):345–383, 2001. ISSN 1573-2916.
15. Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455–492, Dec 1998. ISSN 1573-2916.
16. Sébastien Marmin, Clément Chevalier, and David Ginsbourger. Differentiating the Multipoint Expected Improvement for Optimal Batch Design, pages 37–48. Springer International Publishing, Cham, 2015. ISBN 978-3-319-27926-8.
17. ApS MOSEK. The MOSEK optimization toolbox for Python manual.
18. B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd. SCS: Splitting conic solver, version 1.2.6. https://github.com/cvxgrp/scs, 2016a.
19. Brendan O’Donoghue, Eric Chu, Neal Parikh, and Stephen Boyd. Conic optimization via operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and Applications, 169(3):1042–1068, feb 2016b.
20. Bart PG Van Parys, Paul J Goulart, and Manfred Morari. Distributionally robust expectation inequalities for structured distributions. Optimization-Online e-prints, 2015.
21. Motakuri V Ramana, Levent Tunçel, and Henry Wolkowicz. Strong duality for semidefinite programming. SIAM Journal on Optimization, 7(3):641–662, 1997.
22. Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005. ISBN 026218253X.
23. Olivier Roustant, David Ginsbourger, Yves Deville, et al. DiceKriging, DiceOptim: Two R packages for the analysis of computer experiments by kriging-based metamodeling and optimization. Journal of Statistical Software, 51(i01), 2012.
24. Amar Shah and Zoubin Ghahramani. Parallel predictive entropy search for batch global optimization of expensive objective functions. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3330–3338. Curran Associates, Inc., 2015.
25. Amar Shah, Andrew Gordon Wilson, and Zoubin Ghahramani. Student-t processes as alternatives to gaussian processes. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014, pages 877–885, 2014.
26. Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pages 2951–2959, 2012.
27. J.F. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software, 11–12:625–653, 1999. Version 1.05 available from http://fewcal.kub.nl/sturm.
28. Matt Wytock, Steven Diamond, Felix Heide, and Stephen Boyd. A new architecture for optimization modeling frameworks. In Proceedings of the 6th Workshop on Python for High-Performance and Scientific Computing, PyHPC ’16, pages 36–44, Piscataway, NJ, USA, 2016. IEEE Press. ISBN 978-1-5090-5220-2. doi: 10.1109/PyHPC.2016.5.
29. Y. Zheng, G. Fantuzzi, A. Papachristodoulou, P. Goulart, and A. Wynn. Chordal decomposition in operator-splitting methods for sparse semidefinite programs. 2017.
30. Steve Zymler, Daniel Kuhn, and Berç Rustem. Distributionally robust joint chance constraints with second-order moment information. Mathematical Programming, 137(1):167–198, 2013. ISSN 1436-4646.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters

1042

How to quickly get a good answer:
• Keep your question short and to the point
• Check for grammar or spelling errors.
• Phrase it like a question
Test
Test description