Kernels over Sets of Finite Sets using RKHS Embeddings, with Application to Bayesian (Combinatorial) Optimization
Abstract
We focus on kernel methods for setvalued inputs and their application to Bayesian set optimization, notably combinatorial optimization. We introduce a class of (strictly) positive definite kernels that relies on Reproducing Kernel Hilbert Space embeddings, and successfully generalizes “double sum” set kernels recently considered in Bayesian set optimization, which turn out to be unsuitable for combinatorial optimization. The proposed class of kernels, for which we provide theoretical guarantees, essentially consists in applying an outer kernel on top of the canonical distance induced by a double sum kernel. Proofs of theoretical results about considered kernels are complemented by a few practicalities regarding hyperparameter fitting. We furthermore demonstrate the applicability of our approach in prediction and optimization tasks, relying both on toy examples and on two test cases from mechanical engineering and hydrogeology, respectively. Experimental results illustrate the added value of the approach and open new perspectives in prediction and sequential design with set inputs.
1 Introduction
Kernel methods (Aronszajn, 1950; Kimeldorf and Wahba, 1970; Schölkopf and Smola, 2002; Saitoh and Sawano, 2016) constitute a very versatile framework for a variety of tasks in classification (Steinwart and Christmann, 2008), function approximation based on scattered data (Wendland, 2005), and probabilistic prediction (Rasmussen and Williams, 2006). One of the outstanding features of Gaussian Process (GP) prediction, in particular, is its usability to design Bayesian Optimization (BO) algorithms (Moćkus et al., 1978; Jones et al., 1998; Frazier, 2018) and further sequential design strategies (Risk and Ludkovski, 2018; Binois et al., 2019; Bect et al., 2019). While in most usual BO and related contributions the focus is on continuous problems with vectorvalued inputs, there has been a growing interest recently for GPrelated modelling and BO in the presence of discrete and mixed discretecontinuous inputs (Kondor and Lafferty, 2002; Gramacy and Taddy, 2010; Fortuin et al., 2018; Roustant et al., 2018; GarridoMerchan and HernándezLobato, 2018; Ru et al., 2019; Griffiths and HernándezLobato, 2019). Here we focus specifically on kernels dedicated to finite setvalued inputs and their application to GP modelling and BO, notably (but not only) in combinatorial optimization.
A number of prediction and optimization problems from various application domains involve finite setvalued inputs, encompassing for instance sensor network design (Garnett et al., 2010), simulationbased investigation of the mechanical behaviour of biphasic materials depending on the position of inclusion positions (Ginsbourger et al., 2016), inventory system optimization (Salemi et al., 2019), selection of starting centers in clustering algorithms (Kim et al., 2019), but also speaker recognition and image texture classification (as mentioned by Desobry et al. (2005)), natural language processing tasks with bags of words (Pappas and PopescuBelis, 2017), or optimal positioning of landmarks in shape analysis (Iwata, 2012), to cite a few. Yet, the number of available kernel methods for efficiently tackling such problems is still quite moderate, although the topic has gained interest among the machine learning and further research communities in the last few years. In particular, early investigations regarding the definition of positive definite kernels on finite sets encompass (Kondor and Jebara, 2003), (Grauman and Darrell, 2007), and also indirectly (Cuturi et al., 2005) where kernels between atomic measures are introduced that can also accomodate finite sets as a particular case (assuming a uniform measure, as implicitly done in the considered embedding approach). Kernels on finite sets that have been used in BO include to the best of our knowledge radial kernels with respect to the the earth mover’s distance (Garnett et al., 2010, where the question of their positive definiteness is not discussed), kernels on graphs implicitly defined via precision matrices in the context of Gaussian Markov Random Fields in (Salemi et al., 2019), and a class that we refer to as “double sum” kernels (traced back to Gätner et al. (2002)) in (Kim et al., 2019). From the side of combinatorial optimization, while an approach relying on Bayesian networks was considered already in (Larraiiaga et al., 2000), the topic has recently attracted attention in GPbased BO with respect to the set inputs (see for instance Baptista and Poloczek (2018) where the emphasis is not on the employed kernels, and Oh et al. (2019) where graph representations are used), and also in GPbased BO over the latent space of a variational autoencoder (Griffiths and HernándezLobato, 2019).
Our approach here is to leverage the fertile framework of Reproducing Kernel Hilbert Space Embeddings (Berlinet and ThomasAgnan, 2004; Smola et al., 2007; Sriperumbudur et al., 2011; Muandet et al., 2017) to build a novel class of kernels on finite sets by chaining some “outer” kernels with the canonical (pseudo)distances attached to the double sum kernels of (Gätner et al., 2002; Kim et al., 2019). We show that while restricting “inner” kernels to strictly positive definite ones does not lead to strictly positive definite double sum kernels, combining this assumption with another one guaranteeing strict positive definiteness in Hilbert space of the “outer” kernel (Bachoc et al., 2018) is sufficient for our proposed kernels to be strictly positive definite indeed, a crucial property in particular for combinatorial optimization. We present in turn a few additional results pertaining to the parametrization of this class of kernels and to the related hyperparameter fitting problem, including geometrical considerations around the choice of hyperparameter bounds in the embedding framework.
Section 2 is mainly dedicated to the construction and theoretical analysis of the considered classes of kernels, and complemented by practicalities regarding hyperparameter fitting. In Section 3, numerical experiments are discussed that compare double sum and proposed kernels in prediction and optimization tasks, both on analytical and on two application test cases, namely in mechanical engineering with plasticity simulations of a biphasic material tackled in (Ginsbourger et al., 2016), and in hydrogeology with an original monitoring well selection problem based on the contaminant source localization test case from (Pirot et al., 2019).
2 Set kernels via RKHS embeddings
2.1 Notation and Settings
We focus on positive definite kernels defined over subsets of some base set . Depending on the cases, may be finite or infinite. The considered set of subsets of , denoted , may be the whole power set or a subset thereof, e.g. (also traditionally noted in set theory) the set of subsets of consisting of elements (where is assumed smaller than or equal to the cardinality of ), or the set of all (nonvoid) finite subsets of denoted here . Given a positive definite kernel over and the associated Reproducing Kernel Hilbert Space , what we call the embedding of in is the mapping
(1) 
where stands for the cardinality of . Note that this “set embedding” coincides with the Kernel Mean Embedding (Muandet et al., 2017) in of the uniform probability distribution over .
2.2 From Double Sum to Proposed Kernels
A natural idea to create a positive definite kernel on from this embedding is to plainly take the RKHS scalar product between embedded sets:
(2) 
which is none other than the kernel used in (Kim et al., 2019) and that we refer to here as double sum kernel. As we will see in the next section and in the applications, this positive definite kernel may suffer in some settings from its lack of strict positive definiteness. Yet it appears as a crucial building block in the class of strictly positive definite kernels that we propose here. The first step is to consider the canonical (pseudo)distance on induced by this , namely
(3) 
Coming now to the proposed class of kernels per se, these are obtained by composing what can be called a radial kernel on Hilbert space (See Bachoc et al. (2018) for a reminder) with above. We hence obtain another class of kernels on by writing
(4) 
with being such that is positive definite for any Hilbert space . The fact that kernels generated in this way are positive definite on follows directly from the positive definiteness of on the Hilbert space and the representation of in terms of RKHS distance between the images of by some mapping (See (Berg et al., 1984; Christmann and Steinwart, 2010) for similar constructions). Furtheremore, as we develop next, we can ensure under some assumptions that such kernels will further be strictly positive definite on , a feature that will turn out to be crucial in combinatorial optimization.
2.3 Main Theoretical Results
Proposition 1 (Nonstrict positive definiteness of double sum kernels).
Let be a set, be a positive definite kernel on , and be the set of finite subsets of . Then the kernel over defined by Eq. 2 is positive definite. However, even if is strictly positive definite on , such a will generally not be strictly positive definite on unless is a singleton.
Proof of Prop. 1.
Let us consider here the case where the base set is finite with cardinality , so that . Since is finite, a positive definite kernel on boils down to a Gram matrix, say where . By reformulating Eq. 2 in terms of a bilinear form with respect to a specific vector function of and , we will not only revisit the proof that is indeed positive definite but also establish that it is generally not strictly positive definite. Let us define for that purpose
From there can be reformulated into
(5) 
so that, for any , , and ,
by positive semidefiniteness of , hence implying that is p.d. indeed. Yet, this representation will allow us to shed light on the fact that even if is a positive definite matrix (i.e., that is strictly p.d. on ), the matrix will actually be systematically singular for and even sometimes in cases where . Exploiting Eq. 5 and defining , we get in product form as follows
From there we get that with and so . Now, the rank of a matrix being invariant under pre (or post) multiplication by a non singular matrix, we get in the case of a nonsingular that . Hence for , and the matrix of interest is singular. To see that can be singular when even in the case of an invertible , one can think for instance of the situation where and . ∎
Proposition 2 (Distance over induced by RKHS embedding).
If is strictly positive definite over , then is injective and defined by Eq.2.2 defines a distance over .
Proof of Prop. 2.
straightforwardly inherits from the triangle inequality associated with ’s distance. The only point to check is the definiteness, i.e. that for . Yet, assuming with and for some and would imply that
After regrouping terms when necessary, one would arrive at an equality of the kind (with, for instance in the case , , and for while and for ), which would be in contradiction with ’s strict positive definiteness as soon as , proving in turn that is injective. ∎
Proposition 3 (Strict positive definiteness of ).
If is strictly positive definite over and is such that is strictly positive definite for any Hilbert space , then of Eq. 4 is strictly positive definite over .
Proof of Prop. 3.
is strictly positive definite on by assumption on and the fact that is a particular Hilbert space. The strict positive definiteness of on then follows from the injectivity of implied by the strict positive definiteness of (as established in Prop. 2). ∎
Remark 1.
As mentioned in Bachoc et al. (2018), continuous functions inducing strictly positive definite functions on any Hilbert space can be characterized following Schoenberg’s works both in terms of completely monotone functions and of infinite mixtures of squared exponential kernels (See, e.g., Wendland (2005)).
2.4 Practicalities
In what follows and as in many practical situations, we consider inner kernels of the form , where and is a (strictly) positive definite kernel on taking the value on the diagonal and parametrized by some (vectorvalued or other) hyperparameter . In such a case, denoting and the associated canonical (pseudo)distance, we immediately have that and . As a consequence, if writes for and defining a radial (strictly) positive definite kernel on Hilbert space (possibly depending on some other hyperparameters ignored for simplicity) with ,
and it clearly appears that having both and creates some overparametrization of . For this reason, we adopt the convention that , hence remaining with the hyperparameters , and to be fitted, possibly along with others such as trend and/or noise parameters. In our experiments, where noiseless settings and a constant trend are assumed, we appeal to Maximum Likelihood Estimation with concentration on the parameter and a genetic algorithm with derivatives (Mebane Jr et al., 2011), in the flavour of the solution implemented in the DiceKriging R package (Roustant et al., 2012).
In the numerical experiments presented next, the base set is assumed to be of the form (in our examples ), and we choose for an isotropic Gaussian correlation kernel with a unique range parameter denoted . As for , while any kernel admissible in Hilbert space such as those of the Matérn family would be suitable, we also choose here a Gaussian one for simplicity, and we hence end up with a triplet of covariance hyperparameters, namely . As is taken care of by concentration (i.e. its optimal value for any given value of can be analytically derived as a function of and ), there remains to maximize the corresponding concentrated (a.k.a. profile) loglikelihood function with respect to and . For this purpose the analytical gradient of the concentrated loglikelihood with respect to these parameters has been calculated and implemented. Besides, parameter bounds need to be specified to the chosen optimization algorithm (recall that genoud is used here), and while it seems natural to choose bounds in terms of , the diameter of the unit dimensional hypercube, for the adequate diameter is slightly less straightforward and calls for some analysis with respect to the range of variation of and how it depends on . The next proposition establishes simple yet practically quite useful results regarding the diameter of () with respect to and its maximal value when letting vary.
Proposition 4.
Let be an isotropic positive definite kernel on assumed to be monotonically decreasing to with respect to the Euclidean distance between elements of , and with a range parameter . Then the diameter of , i.e. , is reached with arguments and , where . Furthermore, the supremum of this diameter with respect to is given by .
Proof.
Let us consider two sets . Then, from the fact that a correlation kernel is upperbounded by , we get
3 Applications
We now demonstrate the applicability of the proposed class of kernels for both prediction and optimization purposes, with comparisons when applicable to similar methods based on double sum kernels, and also to random search in the optimization case. In all examples, both inner and outer kernels are assumed Gaussian. The three hyperparameters are estimated by Maximum Likelihood with concentration on , as detailed in Section 2.4. Three synthetic test functions and two application test cases are considered, respectively in mechanical engineering (CASTEM) and in hydrogeology (Contaminant source localization), all presented below. In the CASTEM case, the available data set consists of a given number () of simulation input/outputs, while in the other test cases one may boil down to a similar situation by studying finite sets of subsets. Yet, the hydrogeology test case is the only one where the points/elements of subsets are structurally restricted to remain in a finite , here a set of possible well locations, hence leading to a combinatorial optimization problem.
3.1 Presentation of Test Functions and Cases
3.1.1 Synthetic Functions
Our three synthetic test functions consist of extensions of the rescaled BraninHoo test function (See Picheny et al., 2013), denoted below by , for setvalued inputs. These extensions are based respectively on the maximum (MAX), minimum (MIN), and mean (MEAN) of values associated with each of evaluation points in , leading to
(6) 
(7) 
(8) 
where . Let us remark that by design, the of Eq. 8 is wellsuited to be approximated using the double sum kernel of Eq. 2. Indeed, if is assumed to be a draw of a GP with kernel , then is a draw of a GP with kernel , as numerical results of Sections 3.2 and 3.3 do reflect.
3.1.2 CASTEM Simulations
The CASTEM dataset, inherited from (Ginsbourger et al., 2016), was originally generated from mechanical simulations performed using the Cast3m code (Castem, 2016) to compute equivalent stress values on biphasic material subjected to uniaxial traction. The unitsquare represents a matrix material containing 10 circular inclusions with identical radius of . The dataset consists of 404 pointsets along with their corresponding stress levels. Fig. 1 illustrates two example input/response from this dataset. While the goal pursued in (Ginsbourger et al., 2016) was rather in uncertainty propagation, we consider this data set here also from an optimization perspective.
3.1.3 Selection of Monitoring Wells for Contaminant Source Localization
This test case relies on a benchmark generator of groundwater contaminant source localization problems from (Pirot et al., 2019). The original problems consisted in finding among given candidate source localizations () which globally minimizes some measures of misfit between “reference” (or “observed”) and “simulated” contaminant concentrations at fixed times and monitoring wells such as
(9) 
where is the reference concentration at well and time step , is the corresponding simulated concentration when the source of contaminant is at , and is a given subset from fixed monitoring wells.
Here, instead of fixing the subset of well locations and looking for the optimal , we consider instead the maps of score discrepancies as a function of . From there, the considered combinatorial optimization problem consists in minimizing
(10) 
over the set of subsets of wells from . In the numerical experiments, we fix , and hence the cardinality of the considered set of subsets is . To test the efficiency of our approach on this application, the two contaminant source locations (A and B) and two geological geometries of (Pirot et al., 2019) are considered, leading to four cases.
Since the base set is itself finite here, it follows from Prop. 1 that resulting double sum kernels are not strictly positive definite so that BO with those kernels fails after few iterations, as found in numerical experiments. Two subsets of five well locations are plotted in Fig. 2 along with contours of corresponding score discrepancy maps and values of objective function derived from them.
The first combination (left subfigure) better represents the misfit function overall with a lower value. In fact, this subset is indeed the optimal one, obtained by exhaustive search over all candidates. Our goal is precisely to locate these optimal well locations whose contributions minimize the spatial sum of score discrepancies without involving exhaustive enumeration. The reader is referred to (Pirot et al., 2019) for further details and visualization of the misfit objective function, location of the contaminant source, and the coordinates of well locations.
3.2 Prediction: Settings and Results
To assess the predictive ability of the considered GP models under the considered settings of data sets split into learning and test parts, we appeal to the socalled or “predictive coefficient” (Marrel et al., 2008),
(11) 
where is the number of test pointsets, and are the actual response and the mean values predicted by the GP model, respectively. is the mean of ’s. The closer to the value of , the more efficient the predictor is. In addition, we also look at visual diagnostics based on the comparison of standardized residuals (i.e. divided by GP prediction standard deviations) with the normal distribution, both in cross and external validation.
The total size of datasets used to assess prediction performances for the three synthetic test problems, CASTEM, and the contamination applications are 1000, 404, and 200, respectively. Each dataset is further partitioned into training and testing subdatasets with percentages (80:20), (50:50) and (20:80). Average values over 20 replications are provided in Table 1. We see that the proposed approach gives higher value of than that with double sum kernel on all problems except for the MEAN function. Moreover, tends to increase with the proportion of the full data set used for training, except in one case with CASTEM.
Finally, to highlight the predictive performance of our method, Fig. 3 shows the leaveoneout diagnostic (left panel) and the outofsample validation (right panel) for the source localization application (Source A, Geology 1). The results show relatively moderate departures from the normality assumptions.
GP (proposed )  GP (double sum )  
20:80  50:50  80:20  20:80  50:50  80:20  
MAX  0.6926  0.8001  0.8525  0.6189  0.7429  0.7725 
MIN  0.3309  0.4582  0.4929  0.1406  0.2163  0.2538 
MEAN  0.9996  0.9999  1  1  1  1 
CASTEM  0.5806  0.6641  0.6543  0.5067  0.5326  0.5107 
Cont (Src A,Geo 1)  0.7616  0.8790  0.9115  n.a.  n.a.  n.a. 
Cont (Src A,Geo 2)  0.7228  0.8569  0.9048  n.a.  n.a.  n.a. 
Cont (Src B,Geo 1)  0.7937  0.9029  0.9309  n.a.  n.a.  n.a. 
Cont (Src B,Geo 2)  0.7958  0.8755  0.8968  n.a.  n.a.  n.a. 
3.3 Optimization: Settings and Results
In this section, the efficiency of proposed kernels against double sum kernels are evaluated within the BO framework, using the Expected Improvement (EI) (Moćkus et al., 1978) as infill sampling criterion.
Optimization performances are assessed on 50 repetitions of EI algorithms with initial design pointsets. For each repetition, all algorithms start with the same initial design, and are allocated additional objective function evaluations. The hyperparameters are iteratively redetermined in every iteration using MLE. Concerning EI maximization, in all three synthetic problems and in the CASTEM case, as the problem sizes are relatively small, it is feasible to compute EI values at all pointsets and select the one attaining the highest value. However, for our contaminant source application, since the problem size is , EI maximization is surrogated by taking the best among 500 generated candidate pointsets using 2 strategies in the flavour of (Garnett et al., 2010). The first one focuses on exploitation by considering candidate pointsets departing by only one element from the current best subset. The second one promotes exploration by randomly generating candidate subsets.
The performance is measured by (1) counting the number of trials (out of 50) for which the algorithm could find the best point from the considered dataset and (2) monitoring the distribution (or median/selected quantiles) of best found responses over iterations. A random sampling method is used as baseline. Table 2 summarizes the number of trials that the minimum is found and Fig. 4 represents progress curves in terms of the median value of current best values over 50 trials along with the 25th and 75th percentiles.
Problems 



EI (proposed )  EI (double sum )  RANDOM  
MAX  38  8  3  
MIN  10  9  3  
MEAN  50  50  2  
CASTEM  33  10  6  
Cont (Src A,Geo 1)  43  n.a.  0  
Cont (Src A,Geo 2)  27  n.a.  0  
Cont (Src B,Geo 1)  39  n.a.  0  
Cont (Src B,Geo 2)  29  n.a.  0 
It is clear that EI algorithms with any of the two considered kernels are superior to random sampling. Experiments on synthetic problems show that within the two considered EI algorithm settings, the proposed kernels outperform the double sum ones on the MAX and MIN problems in terms of the number of trials that the minimum is found. On the MEAN problem, though, while both methods could locate the minimum for all 50 replications, using a double sum kernel used fewer number of iterations as can be seen in Fig. 4.
For the CASTEM dataset, EI algorithms with the proposed versus double sum kernels could locate the minimum for 33 and 10 trials, respectively (against for random sampling) confirming the superior performance of the proposed method.
Again due to lack of strict positive definiteness, the double sum kernel is not applicable for the contaminant source applications. In this application, the EI algorithm coupled with proposed kernel is by far better than the random sampling on all four considered cases, being able to locate within 40 iterations the global optimum of the considered combinatorial optimization problem respectively in , , , and out of trials. These results certainly illustrate the potential of our proposed class of kernels to efficiently address expensive combinatorial optimization problems in a Bayesian Optimization framework.
4 Discussion
Experimental results obtained on the analytical objective functions and application test cases clearly confirm the added value of the proposed approach for setfunction prediction and (combinatorial) optimization.
Yet a number of challenges and potential extensions remain to be addressed in future work. This includes computational difficulties that will arise when working with larger numbers of subsets and/or subset cardinalities, and this not only to handle bigger matrices but also to tackle the optimization of infill criteria.
From the test case perspective, future work may also include tackling further prediction and subset selection problems (be it in continuous or combinatorial settings), not only for optimization purposes but also with more general goals around uncertainty quantification and reduction. Besides this, a nice feature of the propose approach is that it would naturally extend to cases with varying subset cardinalities and also with “marked” point sets (in the vein of (Cuturi et al., 2005)’s molecular measures), hence accomodating applications such as CASTEM but with varying inclusion numbers and radii. Furthermore, the conceptual approach of chaining an embedding and a kernel in Hilbert space (also in the flavour of (Christmann and Steinwart, 2010)) could apply to a variety of other input types, opening the door in turn to numerous nonconventional extensions of BO and related algorithms.
Acknowledgements
P.B. would like to thank DPST scholarship project granted by IPST, Ministry of Education, Thailand for providing financial support during his master study. D.G.’s contributions have taken place within the Swiss National Science Foundation project number 178858. Furthermore, D.G. would like to thank several colleagues including notably Fabrice Gamboa, Athénaïs Gautier, Luc Pronzato, and Henry Wynn for enriching discussions in recent years around ideas presented in this paper. T.K. would like to acknowledge the support of Thailand Research Fund under Grant No.: MRG6080208, Center of Excellence in Mathematics, CHE, Thailand, and the Faculty of Science, Mahidol University. The authors would like to acknowledge the support of Idiap Research Institute. In particular, most numerical experiments presented here where run on Idiap’s grid. The authors also thank Drs. Jean Baccou and Frédéric Perales (Institut de Radioprotection et de Sûreté Nucléaire, SaintPaullèsDurance, France) for the CASTEM data, and Dr. Clément Chevalier who has been involved in investigations on this data in the framework of the ReDICE consortium.
References
 Aronszajn (1950) Aronszajn, N. (1950). Theory of reproducing kernels. Transaction of the American Mathematical Society, 68 (3):337 – 404.
 Bachoc et al. (2018) Bachoc, F., Suvorikova, A., Ginsbourger, D., Loubes, J.M., and Spokoiny, V. (2018). Gaussian processes with multidimensional distribution inputs via optimal transport and hilbertian embedding. arXiv preprint arXiv:1805.00753.
 Baptista and Poloczek (2018) Baptista, R. and Poloczek, M. (2018). Bayesian optimization of combinatorial structures. In Proceedings of the 35th International Conference on Machine Learning Learning,.
 Bect et al. (2019) Bect, J., Bachoc, F., and Ginsbourger, D. (2019). A supermartingale approach to Gaussian process based sequential design of experiments. Bernoulli, 25(4A):2883–2919.
 Berg et al. (1984) Berg, C., Christensen, J., and Ressel, P. (1984). Harmonic Analysis on Semigroups. Springer.
 Berlinet and ThomasAgnan (2004) Berlinet, A. and ThomasAgnan, C. (2004). Reproducing kernel Hilbert spaces in probability and statistics. Kluwer Academic Publishers.
 Binois et al. (2019) Binois, M., Huang, J., Gramacy, R., and Ludkovski, M. (2019). Replication or exploration? Sequential design for stochastic simulation experiments. Technometrics, 61(1):7–23.
 Castem (2016) Castem (2016). Cast3m software, http://wwwcast3m.cea.fr.
 Christmann and Steinwart (2010) Christmann, A. and Steinwart, I. (2010). Universal kernels on nonstandard input spaces. In Advances in neural information processing systems, pages 406–414.
 Cuturi et al. (2005) Cuturi, M., Fukumizu, K., and Vert, J. (2005). Semigroup kernels on measures. Journal of Machine Learning Research, 6:1169–1198.
 Desobry et al. (2005) Desobry, F., Davy, M., and Fitzgerald, W. (2005). A class of kernels for sets of vectors. In In Proceedings of the 13th European Symposium on Artificial Neural Networks.
 Fortuin et al. (2018) Fortuin, V., Dresdner, G. Strathmann, H., and Rätsch, G. (2018). Scalable gaussian processes on discrete domains. arXiv:1810.10368.
 Frazier (2018) Frazier, P. (2018). A tutorial on bayesian optimization. arXiv:1807.02811.
 Garnett et al. (2010) Garnett, R., Osborne, M. A., and Roberts, S. J. (2010). Bayesian optimization for sensor set selection. In Proceedings of the 9th ACM/IEEE international conference on information processing in sensor networks, pages 209–219. ACM.
 GarridoMerchan and HernándezLobato (2018) GarridoMerchan, E. and HernándezLobato, D. (2018). Dealing with categorical and integervalued variables in bayesian optimization with gaussian processes. arXiv:1805.03463.
 Gätner et al. (2002) Gätner, T., Flach, P. A., Kowalczyk, A., and Smola, A. J. (2002). Multiinstance kernels. In Proceedings of the International Conference on Machine Learning.
 Ginsbourger et al. (2016) Ginsbourger, D., Baccou, J., Chevalier, C., and Perales, F. (2016). Design of computer experiments using competing distances between setvalued inputs. In mODa 11Advances in ModelOriented Design and Analysis, pages 123–131. Springer.
 Gramacy and Taddy (2010) Gramacy, R. B. and Taddy, M. A. (2010). Categorical inputs, sensitivity analysis, optimization and importance tempering with tgp version 2, an r package for treed gaussian process models. Journal of Statistical Software, 33(6).
 Grauman and Darrell (2007) Grauman, K. and Darrell, T. (2007). The pyramid match kernel: Efficient learning with sets of features. Journal of Machine Learning Research, 8:725–760.
 Griffiths and HernándezLobato (2019) Griffiths, R.R. and HernándezLobato, J. M. (2019). Constrained bayesian optimization for automatic chemical design. arXiv:1709.05501.
 Iwata (2012) Iwata, K. (2012). Placing landmarks suitably for shape analysis by optimization. In 21st International Conference on Pattern Recognition.
 Jones et al. (1998) Jones, D. R., Schonlau, M., and Welch, W. J. (1998). Efficient global optimization of expensive blackbox functions. Journal of Global optimization, 13(4):455–492.
 Kim et al. (2019) Kim, J., McCourt, M., You, T., Kim, S., and Choi, S. (2019). Bayesian optimization over sets. In 6th ICML Workshop on Automated Machine Learning. arXiv:1905.09780.
 Kimeldorf and Wahba (1970) Kimeldorf, G. S. and Wahba, G. (1970). A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann. Math. Statist., 41(2):495–502.
 Kondor and Jebara (2003) Kondor, R. and Jebara, T. (2003). A kernel between sets of vectors. In Proceedings of the Twentieth International Conference on Machine Learning.
 Kondor and Lafferty (2002) Kondor, R. and Lafferty, J. (2002). Diffusion kernels on graphs and other discrete structures. In Proceedings of the 19th International Conference on Machine Learning, page 315–322.
 Larraiiaga et al. (2000) Larraiiaga, P., Etxeberria, R., Lozano, J., and Peiia, J. (2000). Combinatorial optimization by learning and simulation of bayesian networks. In Uncertainty in Artificial Intelligence Proceedings.
 Marrel et al. (2008) Marrel, A., Iooss, B., van Dorpe, F., and Volkova, E. (2008). An efficient methodology for modeling complex computer codes with gaussian processes. Computational Statistics and Data Analysis.
 Mebane Jr et al. (2011) Mebane Jr, W. R., Sekhon, J. S., et al. (2011). Genetic optimization using derivatives: the rgenoud package for r. Journal of Statistical Software, 42(11):1–26.
 Moćkus et al. (1978) Moćkus, J., Tiesis, V., and Źilinskas, A. (1978). The application of bayesian methods for seeking the extremum. vol. 2.
 Muandet et al. (2017) Muandet, K., Fukumizu, K., and B., S. (2017). Kernel mean embedding of distributions : A review and beyond. Foundations and Trends in Machine Learning, 10(12):1–141.
 Oh et al. (2019) Oh, C., Tomczak, J., Gavves, E., and Welling, M. (2019). Combo: Combinatorial bayesian optimization using graph representations. In ICML Workshop on Learning and Reasoning with GraphStructured Data.
 Pappas and PopescuBelis (2017) Pappas, N. and PopescuBelis, A. (2017). Explicit document modeling through weighted multipleinstance learning. Journal of Artificial Intelligence Research, 58.
 Picheny et al. (2013) Picheny, V., Wagner, T., and Ginsbourger, D. (2013). A benchmark of krigingbased infill criteria for noisy optimization. Structural and Multidisciplinary Optimization, 48(3):607–626.
 Pirot et al. (2019) Pirot, G., Krityakierne, T., Ginsbourger, D., and Renard, P. (2019). Contaminant source localization via bayesian global optimization. Hydrology and Earth System Sciences, 23(1):351–369.
 Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. (2006). Gaussian process for machine learning. MIT press.
 Risk and Ludkovski (2018) Risk, J. and Ludkovski, M. (2018). Sequential design and spatial modeling for portfolio tail risk measurement. SIAM Journal on Financial Mathematics, 9(4):1137–1174.
 Roustant et al. (2012) Roustant, O., Ginsbourger, D., and Deville, Y. (2012). Dicekriging, diceoptim: Two r packages for the analysis of computer experiments by krigingbased metamodeling and optimization.
 Roustant et al. (2018) Roustant, O., Padonou, E., Deville, Y., Clémet, A., Perrin, G., Giorla, J., and Wynn, H. (2018). Group kernels for gaussian process metamodels with categorical inputs. arXiv:1802.02368.
 Ru et al. (2019) Ru, B., Alvi, A., Nguyen, V., Osborne, M. A., and Roberts, S. (2019). Bayesian optimisation over multiple continuous and categorical inputs.
 Saitoh and Sawano (2016) Saitoh, S. and Sawano, Y. (2016). Theory of Reproducing Kernels and Applications. Springer.
 Salemi et al. (2019) Salemi, P. L., Song, E., Nelson, B., and Staum, J. (2019). Gaussian markov random fields for discrete optimization via simulation: Framework and algorithms. Operations Research, 67:250–266.
 Schölkopf and Smola (2002) Schölkopf, B. and Smola, A. (2002). Learning with kernels. MIT Press.
 Smola et al. (2007) Smola, A., Gretton, A., Song, L., and Schölkopf, B. (2007). A hilbert space embedding for distributions. In Algorithmic Learning Theory: 18th International Conference, page 13–31. Springer.
 Sriperumbudur et al. (2011) Sriperumbudur, B., Fukumizu, K., and Lanckriet, G. (2011). Universality, characteristic kernels and rkhs embedding of measures. Journal of Machine Learning Research, (12):2389–2410.
 Steinwart and Christmann (2008) Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Springer.
 Wendland (2005) Wendland, H. (2005). Scattered Data Approximation. Cambridge University Press.