Bayesian Optimization over Sets

Bayesian Optimization over Sets

Jungtaek Kim
Pohang, Republic of Korea
Michael McCourt
San Francisco, USA
Tackgeun You
Pohang, Republic of Korea
Saehoon Kim
Seoul, Republic of Korea
Seungjin Choi
Pohang, Republic of Korea

We propose a Bayesian optimization method over sets, to minimize a black-box function that can take a set as single input. Because set inputs are permutation-invariant and variable-length, traditional Gaussian process-based Bayesian optimization strategies which assume vector inputs can fall short. To address this, we develop a Bayesian optimization method with set kernel that is used to build surrogate functions. This kernel accumulates similarity over set elements to enforce permutation-invariance and permit sets of variable size, but this comes at a greater computational cost. To reduce this burden, we propose a more efficient probabilistic approximation which we prove is still positive definite and is an unbiased estimator of the true set kernel. Finally, we present several numerical experiments which demonstrate that our method outperforms other methods in various applications.


Bayesian Optimization over Sets

  Jungtaek Kim POSTECH Pohang, Republic of Korea Michael McCourt SigOpt San Francisco, USA Tackgeun You POSTECH Pohang, Republic of Korea Saehoon Kim AITRICS Seoul, Republic of Korea Seungjin Choi POSTECH Pohang, Republic of Korea


noticebox[b]Preprint. Under review.\end@float

1 Introduction

Bayesian optimization (BO) is an effective method to optimize a black-box function which is expensive to evaluate. It has proven useful in several applications, including hyperparameter optimization [27, 17], neural architecture search [29, 18], material design [8], and synthetic gene design [13]. Classic BO assumes that a search region is defined and that the black-box function can only produce scalar output in the presence of additive noise , i.e., for .

Unlike this standard BO formulation, in this article we assume that our search region is for a fixed positive integer . Thus, for , would take in a set containing elements, all of length , and return a noisy function value :


Our motivating example comes from the soft -means clustering algorithm over a dataset ; in particular, we want to find the optimal initialization of such an algorithm. The objective function for this problem is a squared loss function which takes in the cluster initialization points and returns the weighted distance between the points in and the converged cluster centers . See [19] for more details.

Some previous research has tried to build Gaussian process (GP) models on set data. [10] proposes a method over discrete sets using stationary kernels over the first Wasserstein distance between two sets, though the power set of fixed discrete sets as domain space is not our interest. However, this method needs the complexity , to compute a covariance matrix with respect to sets. Moreover, because it only considers stationary kernels, GP regression is restricted to the form that cannot express non-stationary models [22]. More detailed explanations can be found in Section 2.3.

Therefore, we instead adapt and augment a strategy proposed by [11] involving the creation of a specific set kernel. This set kernel uses a kernel defined on the elements of the sets to build up its own sense of covariance between sets. In turn, then, it can be directly used to build surrogate functions through GP regression, which then can power BO, by Lemma 1.

A key contribution of this article is the development of a computationally efficient approximation to this set kernel. Given total observed function values, the cost of constructing the matrix required for fitting the GP is where approximately (see the complexity analysis in Section 3.3). We propose the use of random subsampling to reduce the computational cost to for while still producing an unbiased estimate of the expected value of the true kernel.

We summarize our contributions as follows.

  • We propose Bayesian optimization over sets, using a set kernel to define covariance for our Gaussian process-based surrogate function.

  • We produce a novel estimator of the set kernel, prove key properties about the bias and variance of this estimator, and analyze its superior computational efficiency relative to the true set kernel.

  • We empirically show our estimator retains the necessary theoretical properties enjoyed by the original kernel to effectively perform in our application.

2 Background

In this section, we introduce ideas and notations necessary to understand BO, as well as the notations and previous research on kernels for sets.

2.1 Bayesian Optimization

BO seeks to minimize an unknown function which is expensive to evaluate:


where is a compact space. It is a sequential optimization strategy which, at each iteration, performs the following three computations:

  1. Using the pieces of data presently available, for , build a probabilistic surrogate model meant to approximate .

  2. Using the surrogate model, compute an acquisition function , which represents the utility of next acquiring data at some new point .

  3. Observe from a true function at the location .

After exhausting a predefined observation budget , BO returns the best point that has the minimum observation. The benefit of this process is that the optimization of the expensive function has been replaced by the optimization of much cheaper and more well-understood acquisition functions .

In this paper, we use GP regression [24] to produce the surrogate function ; from , we use the standard acquisition function expected improvement [20]: , where . See [2, 26, 7] for further details on BO.

2.2 Set Kernel

We start by introducing notation which is required for performing kernel approximation of functions on set data. A set of vectors is denoted . In a collection of such sets (as will occur in the BO setting), the th set would be denoted . Note that we are restricting all sets to be of the same size here.111In principle, sets of varying size can be considered, but we make the restriction to same sized sets to simplify our analysis.

To build a GP surrogate, we require a prior belief of the covariance between elements in . This belief is imposed in the form of a positive-definite covariance kernel ; see [25, 5] for more discussion on approximation with kernels. In addition to the symmetry required in standard kernel settings, kernels on sets require an additional property. The ordering of elements in should be immaterial (since sets have no inherent ordering).

Given an empirical approximation of the kernel mean where is a basis function and is a dimension of projected space by , a set kernel [11, 21] is defined as


since . Here, is a positive-definite kernel defined to measure the covariance between the -dimensional elements of the sets (e.g., a squared exponential or Matérn kernel).

The kernel (3) arises when comparing a class of functions on different probability measures with the intent of understanding if the measures might be equal [14].

Figure 1: Illustration that shows how to select instances from sets, which originally have instances. In this example, and . (Phase 1) Two set inputs are projected by a vector that is randomly drawn from the standard Gaussian distribution. The points that have same color belong to same set (e.g., blue and red). (Phase 2) On the line that the points projected are located, the order of instances is determined. (Phase 3) Using the order of instances, two instances uniformly sampled are selected and they are used to compute the approximated set kernel value.

2.3 Related Works

Although it has been raised in different interests, meta-learning approaches dealt with set inputs are promising in machine learning community, because they can generalize distinct tasks with meta-learners [4, 28, 6, 9]. In particular, [4, 28, 9] propose feed-forward neural networks which take permutation-invariant and variable-length inputs: they have the goal of obtaining features derived from the sets with which to input to a standard (meta-)learning routine. Because they consider modeling of set data, they are related to our work, but they are interested in their own specific examples such as point cloud classification, few-shot learning, and image completion.

In BO, [10] suggests a method to find a set that produces a global minimum with respect to discrete sets, each of which is an element of power set of entire set. Because the time complexity of the first Wasserstein distanace is , they assume a small cardinality of sets and discrete searching space for global optimization method. Furthermore, their method restricts the number of iterations for optimizing an acquisition function, since by the curse of dimensionality the number of iterations should be increased exponentially. However, it implies that the global solution of acquisition function is hard to be found.

Compared to [10], we consider continuous domain space which implies an acquired set can be composed of any instances in a compact space . We thus freely use off-the-shelf global optimization method or multi-started local optimization method [26] with relatively large number of instances in sets. In addition, its structure of kernel is where is a stationary kernel [12] and is some distance function over two sets (e.g., in [10] the first Wasserstein distance). Using the method proposed in the subsequent section, non-stationary kernels might be considered in modeling a surrogate function.

3 Proposed Method

We first propose and analyze an approximation to the set kernel (3) for GP regression in this section. Then, we present our BO method over sets.

In order for (3) to be a viable covariance kernel of a GP regression, it must be positive-definite. To discuss this topic, we denote a list of sets with the notation ; in this notation, the order of the entries matters.

Lemma 1.

Suppose we have a list which contains distinct sets for . We define the matrix as


for defined with a chosen inner kernel as in (3). Then, is a symmetric positive-semidefinite matrix if is a symmetric positive-definite kernel.

This proof appears in [16, Lemma 1], and is also discussed in [11].

3.1 Approximation of the Set Kernel

Computing (4) requires pairwise comparisons between all sets present in , which has computational complexity . To alleviate this cost, we propose to approximate (3) with


where , and and is a subset of which is defined by those three quantities (we omit explicitly listing them in to ease the notation).

The goal of the approximation strategy is to convert from (of size ) to (of size ) in a consistent fashion during all the computations comprising . We accomplish this in two steps:

  1. Use a randomly generated vector to impose an (arbitrary) ordering of the elements of all sets , and

  2. Randomly permute the indices via a function .

These random strategies are defined once before computing the matrix, and then used consistently throughout the entire computation.

To impose an ordering of the elements, we use a random scalar projection such that the elements of are drawn from the standard normal distribution. If the scalar projections of each are computed, this produces the set of scalar values , which can be sorted to generate an ordered list of


for an ordering of distinct indices . Ties between values can be dealt with arbitrarily.

The function then is simply a random bijection of the integers onto themselves. Using this, we can sample vectors from :


This process, given , and is sufficient for computing .

3.2 Properties of the Approximation

The covariance matrix for this approximation of the set kernel, which we denote by , should approximate the full version of covariance matrix, from (4). Because of the random structure introduced in Section 3.1, the matrix will be random. This will be addressed in Theorem 1, but for now, represents a single realization of that random variable, not the random variable itself. To be viable, this approximation must satisfy the following requirements:

Property 1.

The approximation satisfies pairwise symmetry:


Because is uniquely defined given , this simplifies to , which is true because is symmetric.

Property 2.

The “ordering” of the elements in the sets should not matter when computing . Indeed, because (6) enforces ordering based on , and not whatever arbitrary indexing is imposed in defining the elements of the set, the kernel will be permutation invariant.

Property 3.

The kernel approximation (5) reduces to computing on a lower cardinality version of the data (with elements selected from ). Because is positive-definite on these -element sets, we know that is also positive-definite.

Property 4.

Since the approximation method aims to choose subsets of input sets, the computational cost becomes lower than the original formulation.

Missing from these four properties is a statement regarding the quality of the approximation. We address this in Theorem 1, though we first start by stating Lemma 2. Moreover, using Theorem 1, we can also obtain the variance of our estimate, as shown in Theorem 2.

0:  A list of sets , .
0:  A matrix .
1:  Draw a random vector from .
2:  Create a random permutation of the integers with which to define .
3:  Using , assign the ordering of elements in according to (6).
4:  Using , determine the subsets as selected according to (7).
5:  Using these subsets and (5), populate the matrix with values .
Algorithm 1 Forming the approximation to
Lemma 2.

Suppose there are two sets . Without loss of generality, let and denote the th and th of possible subsets containing elements of and , respectively, in an arbitrary ordering. For ,


where and are the th and th elements of and , respectively, in an arbitrary ordering.

Theorem 1.

Suppose that we are given two sets and . Suppose, furthermore, that and can be generated randomly as defined in Section 3.1 to form subsets and . The value of is an unbiased estimator of the value of .

Theorem 2.

Suppose the same conditions as in Theorem 1. Suppose, furthermore, that for all . The variance of is bounded by a function of , and :


To accommodate the page limit, the proofs of Lemma 2, Theorem 1 and Theorem 2 are provided in the supplementary material. ∎

By Theorem 2, we can naturally infer the fact that the upper bound of the variance is small, if is close to . This result is confirmed in Section 4.1.

3.3 Bayesian Optimization over Sets

For BO over , the goal is to identify the set such that a given function is minimized. As shown in Algorithm 2, BO over sets follows similar steps as laid out in Section 2.1, except that it involves the space of set inputs and requires a surrogate function on . As we have already indicated, we plan to use a GP surrogate function, with prior covariance defined either with (3) or (5) and a Matérn 5/2 inner kernel .

Using a GP model requires computation on the order of at the th step of the BO because the matrix must be inverted. Compared to the complexity for computing a full version of the set kernel , the complexity of computing the inverse is smaller if roughly (that is, computing the matrix can be as costly or more costly than inverting it). Because BO is efficient sampling-based global optimization, is small and the situation is reasonable. Therefore, the reduction proposed by the approximation in Section 3.1 can be effective in reducing complexity of all steps for BO over sets.

In addition, since the Cholesky decomposition, instead of matrix inverse is widely used to compute posterior mean and variance functions [24], the time complexity for inverting a covariance matrix can be reduced. But still, if is relatively large, our approximation is effective. In this paper, we compute GP surrogate using the Cholesky decomposition. See the effects of in Section 4.

0:  A domain , a function , a budget .
0:  Best acquired set
1:  Choose an initial point randomly from and evaluate .
2:  for  from 1 to  do
3:     Fit the surrogate model to all available data .
4:     Compute the acquisition function from .
5:     Identify .
6:     Evaluate .
7:  end for
8:  return if .
Algorithm 2 Bayesian Optimization over Sets

4 Experiments

Figure 2: Results on the effects of for set kernels. The mean and standard deviation of of each are plotted, computed over 10 trials.

In this section, we present various experiments to show our method can be employed in the applications. Before showing those examples, we describe the baseline methods:

Vector A standard BO is performed over a -dimensional space where, at the th step, the available data is vectorized to for with associated function values. At each step, the vectorized next location is converted into a set .

Split Individual BO strategies are executed on the components comprising . At the th step, the available data is decomposed into sets of data, the th of which consists of with associated data. The vectors produced during each step of the optimization are then collected to form at which to evaluate .

All codes and scripts used to implement our methods will be released as open source software. The full implementation details are provided in the supplementary material.

4.1 Set Kernel Approximation

We study the effect of for the set kernels introduced in Section 3.1. Using a set generated from standard normal distribution, which has 1,000 -dimensional instances, we observe the effects of as shown in Figure 2. converges to the value (see the supplementary material for the exact values) as is increased, and the variance of value is large when is small, as discussed in Section 3.2. Moreover, the consumed time is increased as is increased. We present the detailed results in the supplementary material.

4.2 Synthetic Functions

(a) Vector
(b) Split
(c) Ours (w/o appr.)
(d) Synthetic 1
(e) Vector
(f) Split
(g) Ours (w/o appr.)
(h) Synthetic 2
Figure 3: Results on two synthetic functions. First three columns depict one of the best acquisition examples (i.e., purple stars indicate instances in the acquired set) via three methods: (i) Vector, (ii) Split, and (iii) Ours (w/o approximation). The last column plots minimum function value over iterations. For Synthetic 1 (first row) and Synthetic 2 (second row), . All experiments are repeated 10 times. Because the experiments with approximation show similar results with the one without approximation, we omit those acquisition examples.

We test two synthetic circumstances to show BO over sets is a valid approach to find an optimal set that minimizes an objective function . In each setting, there is an auxiliary function , and the function is defined as


The functions (see the supplementary material) are designed to be multimodal, giving the opportunity for the set to contain values from each of the modes in the domain. Additionally, as is expected, is permutation invariant (any ordering imposed on the elements of is immaterial).

Synthetic 1 We consider , and choose to be a simple periodic function.

Synthetic 2 The function is the summation of probability density functions, where , .

As shown in Figure 3, both of these circumstances have a clear multimodal structure, allowing for optimal sets to contain points which are clustered in a single local minima or to be spread out through the domain in several local minima. The first three columns of Figure 3, show that the “Vector” and “Split” strategies have difficulty optimizing functions in both circumstances. On the other hand, our proposed method finds optimal outcomes more effectively.222 While not our concern here, it is possible that some amount of distance between points in the set would be desired. If that were the case, such a desire could be enforced in the function .

We study the impact of when optimizing these two synthetic functions; a smaller should yield faster computations, but also a worse approximation to the true matrix (when ). Due to the page limit, we present these results in Table 2 located in the supplementary material. The convergence performance does scale as expected, and the computational cost is hurt by the order of the complexities. As we mentioned in Section 3.3, there exists a situation that the complexity for inverting a covariance matrix can be larger than the complexity for computing the covariance matrix.

4.3 Initialization of Clustering Methods

(a) -means
(b) GMM
Figure 4: Results on initializing -means clustering and GMM are averaged over 10 trials. To compare with the baselines fairly, Random, Data, -means++ [1], -means are run 1,000 times.

We initialize two clustering methods for dataset with BO over sets: (i) -means clustering, and (ii) Gaussian mixture model (GMM). For these experiments, we add four additional baselines for clustering algorithms (see the supplementary material for the additional baselines). To fit a dataset with those four baselines, we use the whole dataset without splitting.

For two clustering algorithms, we generate a dataset where , , and . We split the dataset to training (70%) and test (30%) datasets. In BO settings, after finding the converged cluster centers with training dataset, the adjusted Rand index (ARI) is computed by test dataset. The algorithms are optimized over . All clustering models are implemented using scikit-learn [23].

The function of interest in the -means clustering setting is the converged clustering residual


where is the set of proposed initial cluster centers, is the set of converged cluster centers [19], and are softmax values from the pairwise distances. Here, the fact that is a function of and is omitted for notational simplicity. The set of converged cluster centers is determined through an iterative strategy which is highly dependent on the initial points to converge to effective centers. As shown in Figure 4(a), our method with shows the best performance compared to other baselines.

In contrast to -means clustering, the GMM estimates parameters of Gaussian distributions and mixing parameters between the distributions. Because it is difficult to minimize negative log-likelihood of the observed data, we fit the GMM using expectation-maximization (EM) algorithm [3]. Similarly to -means clustering, this requires initial guesses to converge to cluster centers .

As shown in Figure 4(b), we conduct the experiments to initialize the mixture of Gaussians. Our method shows better performance than other baselines except the “Vector” baseline. Because the “Vector” baseline finds an optimal set under the assumption that we know the structure of sets, it does not directly match to the problem interested in this work, though it can be considered as the baseline.

The computational cost in these experiments is presented as a function of (see the table in the corresponding section of the supplementary material). Because in this problem is larger than Section 4.2, the computational cost follows our expectation, which implies that the complexity for computing a covariance matrix over sets is the major computations in the overall procedure.

5 Conclusion

In this paper, we propose the BO method over sets, which takes a set as an input and produces a scalar output. Our method based on GP regression models a surrogate function using set-taking covariance functions, referred to as set kernel. We approximate the set kernel to the efficient positive-definite kernel that is an unbiased estimator of the original set kernel. Our experimental results demonstrate our method can be used in some novel applications for BO.


  • [1] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), New Orleans, Louisiana, USA, 2007.
  • [2] E. Brochu, V. M. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv e-prints, arXiv:1012.2599, 2010.
  • [3] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1–38, 1977.
  • [4] H. Edwards and A. Storkey. Towards a neural statistician. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 2017.
  • [5] G. E. Fasshauer and M. J. McCourt. Kernel-based Approximation Methods Using Matlab. World Scientific, 2015.
  • [6] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, Australia, 2017.
  • [7] P. I. Frazier. Bayesian optimization. In E. Gel and L. Ntaimo, editors, Recent Advances in Optimization and Modeling of Contemporary Problems, pages 255–278. INFORMS, 2018.
  • [8] P. I. Frazier and J. Wang. Bayesian optimization for materials design. In Information Science for Materials Discovery and Design, pages 45–75. Springer, 2016.
  • [9] M. Garnelo, D. Rosenbaum, C. J. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. J. Rezende, and S. M. A. Eslami. Conditional neural processes. In Proceedings of the International Conference on Machine Learning (ICML), pages 1690–1699, Stockholm, Sweden, 2018.
  • [10] R. Garnett, M. A. Osborne, and S. J. Roberts. Bayesian optimization for sensor set selection. In ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), Stockholm, Sweden, 2010.
  • [11] T. Gätner, P. A. Flach, A. Kowalczyk, and A. J. Smola. Multi-instance kernels. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, Australia, 2002.
  • [12] M. G. Genton. Classes of kernels for machine learning: a statistics perspective. Journal of Machine Learning Research, 2:299–312, 2001.
  • [13] J. González, J. Longworth, D. C. James, and N. D. Lawrence. Bayesian optimization for synthetic gene design. In Neural Information Processing Systems Workshop on Bayesian Optimization (BayesOpt), Montreal, Quebec, Canada, 2014.
  • [14] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13:723–773, 2012.
  • [15] N. Hansen. The CMA evolution strategy: A tutorial. arXiv e-prints, arXiv:1604.00772, 2016.
  • [16] D. Haussler. Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz, 1999.
  • [17] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Proceedings of the International Conference on Learning and Intelligent Optimization, Rome, Italy, 2011.
  • [18] K. Kandasamy, W. Neiswanger, J. Schneider, B. Póczos, and E. P. Xing. Neural architecture search with Bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems (NeurIPS), volume 31, Montreal, Quebec, Canada, 2018.
  • [19] S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982.
  • [20] J. Močkus, V. Tiesis, and A. Źilinskas. The application of Bayesian methods for seeking the extremum. Towards Global Optimization, 2:117–129, 1978.
  • [21] K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2):1–141, 2017.
  • [22] C. J. Paciorek and M. J. Schervish. Nonstationary covariance functions for Gaussian process regression. In Advances in Neural Information Processing Systems (NeurIPS), volume 17, Vancouver, British Columbia, Canada, 2004.
  • [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [24] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
  • [25] B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002.
  • [26] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
  • [27] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems (NeurIPS), volume 25, Lake Tahoe, Nevada, USA, 2012.
  • [28] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, Long Beach, California, USA, 2017.
  • [29] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 2017.

Supplementary Material: Bayesian Optimization over Sets

Appendix S.A Proof of Lemma 2


We can rewrite the original summation in a slightly more convoluted form, as


where if and , and 0 otherwise. As these are finite summations, they can be safely reordered.

The symmetry in the structure and evaluation of the summation implies that as each quantity will be paired with each quantity the same number of times. Therefore, we need only consider the number of times that these quantities appear.

We recognize that this summation follows a pattern related to Pascal’s triangle. Among the possible subsets of , only the fraction of those contain the quantity for all (irrespective of how that entry may be denoted in terminology). Because of the symmetry mentioned above, each of those quantities is paired with each of the quantities the same number of times. This result implies that


where if and , and 0 otherwise. Substituting (s.2) into the bracketed quantity in (s.1) above completes the proof. ∎

Appendix S.B Proofs of Set Kernel Estimator Properties

We start by introducing the notation and to be random variables such that and is a uniformly random permutation of the integers between 1 and . These are the distributions defining the and quantities described above. With this, we note that is a random variable.

We also introduce the notation to be the distribution of random subsets of with elements selected without replacement, the outcome of the subset selection from Section 3.1. This notation allows us to write the quantities


for . We have dropped the random variables from the expectation and variance definitions for ease of notation.

s.b.1 Proof of Theorem 1


Our goal is to show that , where is defined in (s.3).

We first introduce an extreme case: . If , the subsets we are constructing are the full sets, i.e., contains only one element, . Thus, is not a random variable.

For , we compute this expected value from the definition (with some abuse of notation):


There are subsets, all of which could be indexed (arbitrarily) as for . The probability mass function is uniform across all subsets, meaning that . Using this, we know


We apply (3) to see that


following the notational conventions used above. The expectation involves four nested summations,


We utilize Lemma 2 to rewrite this as


s.b.2 Proof of Theorem 2


The variance of , defined in (s.4), is computed as


where Theorem 1 is invoked to produce the final line. Using (s.6) and (s.7), we can express the first term of (s.10) as


At this point, we invoke the fact that to state


which is true because the summation to terms contains all of the elements in the summation to terms, as well as other (nonnegative) elements. Using this, we can bound (s.11) by


Therefore, with (s.13), (s.10) can be written as


which concludes this proof. ∎

The restriction is satisfied by many standard covariance kernels (such as the Gaussian, the Matérn family and the multiquadric) as well as some more interesting choices (such as the Wendland or Wu families of compactly supported kernels). It does, however, exclude some oscillatory kernels such as the Poisson kernel as well as kernels defined implicitly which may have an oscillatory behavior. More discussion on different types of kernels and their properties can be found in the kernel literature [5].

Appendix S.C Experiments

In this section, we describe the detailed settings, baselines, and results that are not covered in the main article.

We use Gaussian process regression [24] with a set kernel or Matérn 5/2 kernel as a surrogate function. Because computing the inverse of covariance matrix needs heavy computations, we use the Cholesky decomposition instead [24]. For the experiments with set kernel, Matérn 5/2 kernel is used as a base kernel. All Gaussian process regression models are optimized by marginal likelihood maximization with the BFGS optimizer, to find kernel hyperparameters. For an acquisition function, we use expected improvement criterion [20] for all experiments. CMA-ES [15] is applied in optimizing the acquisition function. Furthermore, five initial points are given to start a single round of Bayesian optimization. Unless otherwise specified, most of experiments are repeated 10 times. For the execution time results, all the results include the time consumed in evaluating a true function.

All results are run via the CPU computations. The experiments for GMM are run by Intel Xeon CPU E5-1620 3.50GHz, and other experiments are run by Intel Core i7-5930K 3.50GHz. All implementations which will be released as open source project are written in Python. Thanks to [23], we use scikit-learn in many parts of our implementations. The confidence interval for all experiments is omitted for the simplicity of the figures.

s.c.1 Effects of Set Kernel Approximation

Time (sec.)
Table 1: The effects of for set kernels. All settings follows the settings in Figure 2. The numerical results are rounded to the three decimals, to show the effects precisely.

We use Matérn 5/2 kernel as a base kernel. Table 1 shows the effects of for set kernels. As we mentioned in the main article, as is increased, value is converged to the true value and execution time is increased.

s.c.2 Synthetic Functions

We define our synthetic functions as

Synthetic 1 We consider , and a function which is a simple periodic function


Synthetic 2 We consider , and a function which is the summation of probability density functions


where is the normal density function with depicted in Figure 3 and .

Synthetic 1 Synthetic 2
Min. function value Time ( sec.) Min. function value Time ( sec.)
Table 2: Convergence quality and its execution time on the synthetic functions. All settings follow the settings in Figure 3.

Table 2 represents a convergence quality and its execution time for the synthetic functions defined in this work. Because of the order of the complexities induced from various computations, the results on computational cost are saturated from relatively large (e.g., ). See the main article for the detailed explanations.

s.c.3 Initialization of Clustering Methods

The four baselines that are additionally used in initializing clustering methods are defined as below:

Random This baseline randomly draws points from a compact space .

Data This baseline randomly samples points from a dataset . It is widely used in initializing a clustering algorithm.

(-means only) -means++ [1] This is a method for -means clustering with the intuition that spreading out initial cluster centers is better than the “Data” baseline (see [1] for the details).

(GMM only) -means This baseline sets initial cluster centers as the results of -means clustering.

-means clustering Gaussian mixture model
ARI Time ( sec.) ARI Time ( sec.)