Bayesian Optimization over Sets
Abstract
We propose a Bayesian optimization method over sets, to minimize a blackbox function that can take a set as single input. Because set inputs are permutationinvariant and variablelength, traditional Gaussian processbased Bayesian optimization strategies which assume vector inputs can fall short. To address this, we develop a Bayesian optimization method with set kernel that is used to build surrogate functions. This kernel accumulates similarity over set elements to enforce permutationinvariance and permit sets of variable size, but this comes at a greater computational cost. To reduce this burden, we propose a more efficient probabilistic approximation which we prove is still positive definite and is an unbiased estimator of the true set kernel. Finally, we present several numerical experiments which demonstrate that our method outperforms other methods in various applications.
Bayesian Optimization over Sets
Jungtaek Kim POSTECH Pohang, Republic of Korea jtkim@postech.ac.kr Michael McCourt SigOpt San Francisco, USA mccourt@sigopt.com Tackgeun You POSTECH Pohang, Republic of Korea tackgeun.you@postech.ac.kr Saehoon Kim AITRICS Seoul, Republic of Korea shkim@aitrics.com Seungjin Choi POSTECH Pohang, Republic of Korea seungjin@postech.ac.kr
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Bayesian optimization (BO) is an effective method to optimize a blackbox function which is expensive to evaluate. It has proven useful in several applications, including hyperparameter optimization [27, 17], neural architecture search [29, 18], material design [8], and synthetic gene design [13]. Classic BO assumes that a search region is defined and that the blackbox function can only produce scalar output in the presence of additive noise , i.e., for .
Unlike this standard BO formulation, in this article we assume that our search region is for a fixed positive integer . Thus, for , would take in a set containing elements, all of length , and return a noisy function value :
(1) 
Our motivating example comes from the soft means clustering algorithm over a dataset ; in particular, we want to find the optimal initialization of such an algorithm. The objective function for this problem is a squared loss function which takes in the cluster initialization points and returns the weighted distance between the points in and the converged cluster centers . See [19] for more details.
Some previous research has tried to build Gaussian process (GP) models on set data. [10] proposes a method over discrete sets using stationary kernels over the first Wasserstein distance between two sets, though the power set of fixed discrete sets as domain space is not our interest. However, this method needs the complexity , to compute a covariance matrix with respect to sets. Moreover, because it only considers stationary kernels, GP regression is restricted to the form that cannot express nonstationary models [22]. More detailed explanations can be found in Section 2.3.
Therefore, we instead adapt and augment a strategy proposed by [11] involving the creation of a specific set kernel. This set kernel uses a kernel defined on the elements of the sets to build up its own sense of covariance between sets. In turn, then, it can be directly used to build surrogate functions through GP regression, which then can power BO, by Lemma 1.
A key contribution of this article is the development of a computationally efficient approximation to this set kernel. Given total observed function values, the cost of constructing the matrix required for fitting the GP is where approximately (see the complexity analysis in Section 3.3). We propose the use of random subsampling to reduce the computational cost to for while still producing an unbiased estimate of the expected value of the true kernel.
We summarize our contributions as follows.

We propose Bayesian optimization over sets, using a set kernel to define covariance for our Gaussian processbased surrogate function.

We produce a novel estimator of the set kernel, prove key properties about the bias and variance of this estimator, and analyze its superior computational efficiency relative to the true set kernel.

We empirically show our estimator retains the necessary theoretical properties enjoyed by the original kernel to effectively perform in our application.
2 Background
In this section, we introduce ideas and notations necessary to understand BO, as well as the notations and previous research on kernels for sets.
2.1 Bayesian Optimization
BO seeks to minimize an unknown function which is expensive to evaluate:
(2) 
where is a compact space. It is a sequential optimization strategy which, at each iteration, performs the following three computations:

Using the pieces of data presently available, for , build a probabilistic surrogate model meant to approximate .

Using the surrogate model, compute an acquisition function , which represents the utility of next acquiring data at some new point .

Observe from a true function at the location .
After exhausting a predefined observation budget , BO returns the best point that has the minimum observation. The benefit of this process is that the optimization of the expensive function has been replaced by the optimization of much cheaper and more wellunderstood acquisition functions .
2.2 Set Kernel
We start by introducing notation which is required for performing kernel approximation of functions on set data. A set of vectors is denoted . In a collection of such sets (as will occur in the BO setting), the th set would be denoted . Note that we are restricting all sets to be of the same size here.^{1}^{1}1In principle, sets of varying size can be considered, but we make the restriction to same sized sets to simplify our analysis.
To build a GP surrogate, we require a prior belief of the covariance between elements in . This belief is imposed in the form of a positivedefinite covariance kernel ; see [25, 5] for more discussion on approximation with kernels. In addition to the symmetry required in standard kernel settings, kernels on sets require an additional property. The ordering of elements in should be immaterial (since sets have no inherent ordering).
Given an empirical approximation of the kernel mean where is a basis function and is a dimension of projected space by , a set kernel [11, 21] is defined as
(3) 
since . Here, is a positivedefinite kernel defined to measure the covariance between the dimensional elements of the sets (e.g., a squared exponential or Matérn kernel).
2.3 Related Works
Although it has been raised in different interests, metalearning approaches dealt with set inputs are promising in machine learning community, because they can generalize distinct tasks with metalearners [4, 28, 6, 9]. In particular, [4, 28, 9] propose feedforward neural networks which take permutationinvariant and variablelength inputs: they have the goal of obtaining features derived from the sets with which to input to a standard (meta)learning routine. Because they consider modeling of set data, they are related to our work, but they are interested in their own specific examples such as point cloud classification, fewshot learning, and image completion.
In BO, [10] suggests a method to find a set that produces a global minimum with respect to discrete sets, each of which is an element of power set of entire set. Because the time complexity of the first Wasserstein distanace is , they assume a small cardinality of sets and discrete searching space for global optimization method. Furthermore, their method restricts the number of iterations for optimizing an acquisition function, since by the curse of dimensionality the number of iterations should be increased exponentially. However, it implies that the global solution of acquisition function is hard to be found.
Compared to [10], we consider continuous domain space which implies an acquired set can be composed of any instances in a compact space . We thus freely use offtheshelf global optimization method or multistarted local optimization method [26] with relatively large number of instances in sets. In addition, its structure of kernel is where is a stationary kernel [12] and is some distance function over two sets (e.g., in [10] the first Wasserstein distance). Using the method proposed in the subsequent section, nonstationary kernels might be considered in modeling a surrogate function.
3 Proposed Method
We first propose and analyze an approximation to the set kernel (3) for GP regression in this section. Then, we present our BO method over sets.
In order for (3) to be a viable covariance kernel of a GP regression, it must be positivedefinite. To discuss this topic, we denote a list of sets with the notation ; in this notation, the order of the entries matters.
Lemma 1.
Suppose we have a list which contains distinct sets for . We define the matrix as
(4) 
for defined with a chosen inner kernel as in (3). Then, is a symmetric positivesemidefinite matrix if is a symmetric positivedefinite kernel.
3.1 Approximation of the Set Kernel
Computing (4) requires pairwise comparisons between all sets present in , which has computational complexity . To alleviate this cost, we propose to approximate (3) with
(5) 
where , and and is a subset of which is defined by those three quantities (we omit explicitly listing them in to ease the notation).
The goal of the approximation strategy is to convert from (of size ) to (of size ) in a consistent fashion during all the computations comprising . We accomplish this in two steps:

Use a randomly generated vector to impose an (arbitrary) ordering of the elements of all sets , and

Randomly permute the indices via a function .
These random strategies are defined once before computing the matrix, and then used consistently throughout the entire computation.
To impose an ordering of the elements, we use a random scalar projection such that the elements of are drawn from the standard normal distribution. If the scalar projections of each are computed, this produces the set of scalar values , which can be sorted to generate an ordered list of
(6) 
for an ordering of distinct indices . Ties between values can be dealt with arbitrarily.
The function then is simply a random bijection of the integers onto themselves. Using this, we can sample vectors from :
(7) 
This process, given , and is sufficient for computing .
3.2 Properties of the Approximation
The covariance matrix for this approximation of the set kernel, which we denote by , should approximate the full version of covariance matrix, from (4). Because of the random structure introduced in Section 3.1, the matrix will be random. This will be addressed in Theorem 1, but for now, represents a single realization of that random variable, not the random variable itself. To be viable, this approximation must satisfy the following requirements:
Property 1.
The approximation satisfies pairwise symmetry:
(8) 
Because is uniquely defined given , this simplifies to , which is true because is symmetric.
Property 2.
The “ordering” of the elements in the sets should not matter when computing . Indeed, because (6) enforces ordering based on , and not whatever arbitrary indexing is imposed in defining the elements of the set, the kernel will be permutation invariant.
Property 3.
The kernel approximation (5) reduces to computing on a lower cardinality version of the data (with elements selected from ). Because is positivedefinite on these element sets, we know that is also positivedefinite.
Property 4.
Since the approximation method aims to choose subsets of input sets, the computational cost becomes lower than the original formulation.
Missing from these four properties is a statement regarding the quality of the approximation. We address this in Theorem 1, though we first start by stating Lemma 2. Moreover, using Theorem 1, we can also obtain the variance of our estimate, as shown in Theorem 2.
Lemma 2.
Suppose there are two sets . Without loss of generality, let and denote the th and th of possible subsets containing elements of and , respectively, in an arbitrary ordering. For ,
(9) 
where and are the th and th elements of and , respectively, in an arbitrary ordering.
Theorem 1.
Suppose that we are given two sets and . Suppose, furthermore, that and can be generated randomly as defined in Section 3.1 to form subsets and . The value of is an unbiased estimator of the value of .
Theorem 2.
Suppose the same conditions as in Theorem 1. Suppose, furthermore, that for all . The variance of is bounded by a function of , and :
(10) 
Proof.
3.3 Bayesian Optimization over Sets
For BO over , the goal is to identify the set such that a given function is minimized. As shown in Algorithm 2, BO over sets follows similar steps as laid out in Section 2.1, except that it involves the space of set inputs and requires a surrogate function on . As we have already indicated, we plan to use a GP surrogate function, with prior covariance defined either with (3) or (5) and a Matérn 5/2 inner kernel .
Using a GP model requires computation on the order of at the th step of the BO because the matrix must be inverted. Compared to the complexity for computing a full version of the set kernel , the complexity of computing the inverse is smaller if roughly (that is, computing the matrix can be as costly or more costly than inverting it). Because BO is efficient samplingbased global optimization, is small and the situation is reasonable. Therefore, the reduction proposed by the approximation in Section 3.1 can be effective in reducing complexity of all steps for BO over sets.
In addition, since the Cholesky decomposition, instead of matrix inverse is widely used to compute posterior mean and variance functions [24], the time complexity for inverting a covariance matrix can be reduced. But still, if is relatively large, our approximation is effective. In this paper, we compute GP surrogate using the Cholesky decomposition. See the effects of in Section 4.
4 Experiments
In this section, we present various experiments to show our method can be employed in the applications. Before showing those examples, we describe the baseline methods:
Vector A standard BO is performed over a dimensional space where, at the th step, the available data is vectorized to for with associated function values. At each step, the vectorized next location is converted into a set .
Split Individual BO strategies are executed on the components comprising . At the th step, the available data is decomposed into sets of data, the th of which consists of with associated data. The vectors produced during each step of the optimization are then collected to form at which to evaluate .
All codes and scripts used to implement our methods will be released as open source software. The full implementation details are provided in the supplementary material.
4.1 Set Kernel Approximation
We study the effect of for the set kernels introduced in Section 3.1. Using a set generated from standard normal distribution, which has 1,000 dimensional instances, we observe the effects of as shown in Figure 2. converges to the value (see the supplementary material for the exact values) as is increased, and the variance of value is large when is small, as discussed in Section 3.2. Moreover, the consumed time is increased as is increased. We present the detailed results in the supplementary material.
4.2 Synthetic Functions
We test two synthetic circumstances to show BO over sets is a valid approach to find an optimal set that minimizes an objective function . In each setting, there is an auxiliary function , and the function is defined as
(11) 
The functions (see the supplementary material) are designed to be multimodal, giving the opportunity for the set to contain values from each of the modes in the domain. Additionally, as is expected, is permutation invariant (any ordering imposed on the elements of is immaterial).
Synthetic 1 We consider , and choose to be a simple periodic function.
Synthetic 2 The function is the summation of probability density functions, where , .
As shown in Figure 3, both of these circumstances have a clear multimodal structure, allowing for optimal sets to contain points which are clustered in a single local minima or to be spread out through the domain in several local minima. The first three columns of Figure 3, show that the “Vector” and “Split” strategies have difficulty optimizing functions in both circumstances. On the other hand, our proposed method finds optimal outcomes more effectively.^{2}^{2}2 While not our concern here, it is possible that some amount of distance between points in the set would be desired. If that were the case, such a desire could be enforced in the function .
We study the impact of when optimizing these two synthetic functions; a smaller should yield faster computations, but also a worse approximation to the true matrix (when ). Due to the page limit, we present these results in Table 2 located in the supplementary material. The convergence performance does scale as expected, and the computational cost is hurt by the order of the complexities. As we mentioned in Section 3.3, there exists a situation that the complexity for inverting a covariance matrix can be larger than the complexity for computing the covariance matrix.
4.3 Initialization of Clustering Methods
We initialize two clustering methods for dataset with BO over sets: (i) means clustering, and (ii) Gaussian mixture model (GMM). For these experiments, we add four additional baselines for clustering algorithms (see the supplementary material for the additional baselines). To fit a dataset with those four baselines, we use the whole dataset without splitting.
For two clustering algorithms, we generate a dataset where , , and . We split the dataset to training (70%) and test (30%) datasets. In BO settings, after finding the converged cluster centers with training dataset, the adjusted Rand index (ARI) is computed by test dataset. The algorithms are optimized over . All clustering models are implemented using scikitlearn [23].
The function of interest in the means clustering setting is the converged clustering residual
(12) 
where is the set of proposed initial cluster centers, is the set of converged cluster centers [19], and are softmax values from the pairwise distances. Here, the fact that is a function of and is omitted for notational simplicity. The set of converged cluster centers is determined through an iterative strategy which is highly dependent on the initial points to converge to effective centers. As shown in Figure 4(a), our method with shows the best performance compared to other baselines.
In contrast to means clustering, the GMM estimates parameters of Gaussian distributions and mixing parameters between the distributions. Because it is difficult to minimize negative loglikelihood of the observed data, we fit the GMM using expectationmaximization (EM) algorithm [3]. Similarly to means clustering, this requires initial guesses to converge to cluster centers .
As shown in Figure 4(b), we conduct the experiments to initialize the mixture of Gaussians. Our method shows better performance than other baselines except the “Vector” baseline. Because the “Vector” baseline finds an optimal set under the assumption that we know the structure of sets, it does not directly match to the problem interested in this work, though it can be considered as the baseline.
The computational cost in these experiments is presented as a function of (see the table in the corresponding section of the supplementary material). Because in this problem is larger than Section 4.2, the computational cost follows our expectation, which implies that the complexity for computing a covariance matrix over sets is the major computations in the overall procedure.
5 Conclusion
In this paper, we propose the BO method over sets, which takes a set as an input and produces a scalar output. Our method based on GP regression models a surrogate function using settaking covariance functions, referred to as set kernel. We approximate the set kernel to the efficient positivedefinite kernel that is an unbiased estimator of the original set kernel. Our experimental results demonstrate our method can be used in some novel applications for BO.
References
 [1] D. Arthur and S. Vassilvitskii. kmeans++: The advantages of careful seeding. In Proceedings of the ACMSIAM Symposium on Discrete Algorithms (SODA), New Orleans, Louisiana, USA, 2007.
 [2] E. Brochu, V. M. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv eprints, arXiv:1012.2599, 2010.
 [3] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1–38, 1977.
 [4] H. Edwards and A. Storkey. Towards a neural statistician. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 2017.
 [5] G. E. Fasshauer and M. J. McCourt. Kernelbased Approximation Methods Using Matlab. World Scientific, 2015.
 [6] C. Finn, P. Abbeel, and S. Levine. Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, Australia, 2017.
 [7] P. I. Frazier. Bayesian optimization. In E. Gel and L. Ntaimo, editors, Recent Advances in Optimization and Modeling of Contemporary Problems, pages 255–278. INFORMS, 2018.
 [8] P. I. Frazier and J. Wang. Bayesian optimization for materials design. In Information Science for Materials Discovery and Design, pages 45–75. Springer, 2016.
 [9] M. Garnelo, D. Rosenbaum, C. J. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. J. Rezende, and S. M. A. Eslami. Conditional neural processes. In Proceedings of the International Conference on Machine Learning (ICML), pages 1690–1699, Stockholm, Sweden, 2018.
 [10] R. Garnett, M. A. Osborne, and S. J. Roberts. Bayesian optimization for sensor set selection. In ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), Stockholm, Sweden, 2010.
 [11] T. Gätner, P. A. Flach, A. Kowalczyk, and A. J. Smola. Multiinstance kernels. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, Australia, 2002.
 [12] M. G. Genton. Classes of kernels for machine learning: a statistics perspective. Journal of Machine Learning Research, 2:299–312, 2001.
 [13] J. González, J. Longworth, D. C. James, and N. D. Lawrence. Bayesian optimization for synthetic gene design. In Neural Information Processing Systems Workshop on Bayesian Optimization (BayesOpt), Montreal, Quebec, Canada, 2014.
 [14] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kernel twosample test. Journal of Machine Learning Research, 13:723–773, 2012.
 [15] N. Hansen. The CMA evolution strategy: A tutorial. arXiv eprints, arXiv:1604.00772, 2016.
 [16] D. Haussler. Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz, 1999.
 [17] F. Hutter, H. H. Hoos, and K. LeytonBrown. Sequential modelbased optimization for general algorithm configuration. In Proceedings of the International Conference on Learning and Intelligent Optimization, Rome, Italy, 2011.
 [18] K. Kandasamy, W. Neiswanger, J. Schneider, B. Póczos, and E. P. Xing. Neural architecture search with Bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems (NeurIPS), volume 31, Montreal, Quebec, Canada, 2018.
 [19] S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982.
 [20] J. Močkus, V. Tiesis, and A. Źilinskas. The application of Bayesian methods for seeking the extremum. Towards Global Optimization, 2:117–129, 1978.
 [21] K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(12):1–141, 2017.
 [22] C. J. Paciorek and M. J. Schervish. Nonstationary covariance functions for Gaussian process regression. In Advances in Neural Information Processing Systems (NeurIPS), volume 17, Vancouver, British Columbia, Canada, 2004.
 [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 [24] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
 [25] B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002.
 [26] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
 [27] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems (NeurIPS), volume 25, Lake Tahoe, Nevada, USA, 2012.
 [28] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, Long Beach, California, USA, 2017.
 [29] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 2017.
Supplementary Material: Bayesian Optimization over Sets
Appendix S.A Proof of Lemma 2
Proof.
We can rewrite the original summation in a slightly more convoluted form, as
(s.1) 
where if and , and 0 otherwise. As these are finite summations, they can be safely reordered.
The symmetry in the structure and evaluation of the summation implies that as each quantity will be paired with each quantity the same number of times. Therefore, we need only consider the number of times that these quantities appear.
We recognize that this summation follows a pattern related to Pascal’s triangle. Among the possible subsets of , only the fraction of those contain the quantity for all (irrespective of how that entry may be denoted in terminology). Because of the symmetry mentioned above, each of those quantities is paired with each of the quantities the same number of times. This result implies that
(s.2) 
where if and , and 0 otherwise. Substituting (s.2) into the bracketed quantity in (s.1) above completes the proof. ∎
Appendix S.B Proofs of Set Kernel Estimator Properties
We start by introducing the notation and to be random variables such that and is a uniformly random permutation of the integers between 1 and . These are the distributions defining the and quantities described above. With this, we note that is a random variable.
We also introduce the notation to be the distribution of random subsets of with elements selected without replacement, the outcome of the subset selection from Section 3.1. This notation allows us to write the quantities
(s.3)  
(s.4) 
for . We have dropped the random variables from the expectation and variance definitions for ease of notation.
s.b.1 Proof of Theorem 1
Proof.
Our goal is to show that , where is defined in (s.3).
We first introduce an extreme case: . If , the subsets we are constructing are the full sets, i.e., contains only one element, . Thus, is not a random variable.
For , we compute this expected value from the definition (with some abuse of notation):
(s.5) 
There are subsets, all of which could be indexed (arbitrarily) as for . The probability mass function is uniform across all subsets, meaning that . Using this, we know
(s.6) 
We apply (3) to see that
(s.7) 
following the notational conventions used above. The expectation involves four nested summations,
(s.8) 
We utilize Lemma 2 to rewrite this as
(s.9) 
∎
s.b.2 Proof of Theorem 2
Proof.
The variance of , defined in (s.4), is computed as
(s.10) 
where Theorem 1 is invoked to produce the final line. Using (s.6) and (s.7), we can express the first term of (s.10) as
(s.11) 
At this point, we invoke the fact that to state
(s.12) 
which is true because the summation to terms contains all of the elements in the summation to terms, as well as other (nonnegative) elements. Using this, we can bound (s.11) by
(s.13) 
Therefore, with (s.13), (s.10) can be written as
(s.14) 
which concludes this proof. ∎
The restriction is satisfied by many standard covariance kernels (such as the Gaussian, the Matérn family and the multiquadric) as well as some more interesting choices (such as the Wendland or Wu families of compactly supported kernels). It does, however, exclude some oscillatory kernels such as the Poisson kernel as well as kernels defined implicitly which may have an oscillatory behavior. More discussion on different types of kernels and their properties can be found in the kernel literature [5].
Appendix S.C Experiments
In this section, we describe the detailed settings, baselines, and results that are not covered in the main article.
We use Gaussian process regression [24] with a set kernel or Matérn 5/2 kernel as a surrogate function. Because computing the inverse of covariance matrix needs heavy computations, we use the Cholesky decomposition instead [24]. For the experiments with set kernel, Matérn 5/2 kernel is used as a base kernel. All Gaussian process regression models are optimized by marginal likelihood maximization with the BFGS optimizer, to find kernel hyperparameters. For an acquisition function, we use expected improvement criterion [20] for all experiments. CMAES [15] is applied in optimizing the acquisition function. Furthermore, five initial points are given to start a single round of Bayesian optimization. Unless otherwise specified, most of experiments are repeated 10 times. For the execution time results, all the results include the time consumed in evaluating a true function.
All results are run via the CPU computations. The experiments for GMM are run by Intel Xeon CPU E51620 3.50GHz, and other experiments are run by Intel Core i75930K 3.50GHz. All implementations which will be released as open source project are written in Python. Thanks to [23], we use scikitlearn in many parts of our implementations. The confidence interval for all experiments is omitted for the simplicity of the figures.
s.c.1 Effects of Set Kernel Approximation
Time (sec.)  

We use Matérn 5/2 kernel as a base kernel. Table 1 shows the effects of for set kernels. As we mentioned in the main article, as is increased, value is converged to the true value and execution time is increased.
s.c.2 Synthetic Functions
We define our synthetic functions as
Synthetic 1 We consider , and a function which is a simple periodic function
(s.15) 
Synthetic 2 We consider , and a function which is the summation of probability density functions
(s.16) 
where is the normal density function with depicted in Figure 3 and .
Synthetic 1  Synthetic 2  

Min. function value  Time ( sec.)  Min. function value  Time ( sec.)  
Table 2 represents a convergence quality and its execution time for the synthetic functions defined in this work. Because of the order of the complexities induced from various computations, the results on computational cost are saturated from relatively large (e.g., ). See the main article for the detailed explanations.
s.c.3 Initialization of Clustering Methods
The four baselines that are additionally used in initializing clustering methods are defined as below:
Random This baseline randomly draws points from a compact space .
Data This baseline randomly samples points from a dataset . It is widely used in initializing a clustering algorithm.
(means only) means++ [1] This is a method for means clustering with the intuition that spreading out initial cluster centers is better than the “Data” baseline (see [1] for the details).
(GMM only) means This baseline sets initial cluster centers as the results of means clustering.
means clustering  Gaussian mixture model  

ARI  Time ( sec.)  ARI  Time ( sec.)  