On Linear Convergence of Weighted Kernel Herding

On Linear Convergence of Weighted Kernel Herding

Abstract

We provide a novel convergence analysis of two popular sampling algorithms, Weighted Kernel Herding and Sequential Bayesian Quadrature, that are used to approximate the expectation of a function under a distribution. Previous theoretical analysis was insufficient to explain the empirical successes of these algorithms. We improve upon previous convergence rates to show that, under mild assumptions, these algorithms converge linearly. To this end, we also suggest a simplifying assumption that holds true for finite dimensional kernels, and that acts as a sufficient condition for linear convergence to hold in the much harder case of infinite dimensions. When this condition is not satisfied, we provide a weaker guarantee. Our analysis also yields a new distributed algorithm for large-scale computation that we prove converges linearly under the same assumptions. Finally, we provide an empirical evaluation to test the proposed algorithm.

1 Introduction

Estimating expectations of functions is a common problem that is fundamental to many applications in machine learning, e.g., computation of marginals, prediction after marginalization of latent variables, calculating the risk, bounding the generalization error, estimating sufficient statistics, etc. In all these cases, the goal is to approximate tractably the integral

 Ep[f(x)]=∫Xf(x)p(x)dx, (1)

for a given function , with value oracle access, and density , with sampling oracle access. In most cases of interest, the integral is not computable in closed form. A common solution is to sample from the domain of , and then to approximate the full integration using the samples. The simplest version of this is to use vanilla Monte Carlo (MC) Integration—sample uniformly at random from the domain of , and then output the empirical average as the estimate of the integral. This simple algorithm converges at a prohibitively slow rate of . More generally, one can apply Markov Chain Monte Carlo (MCMC) algorithms for (1), and these algorithms come with many benefits, but they also converge quite slowly.

To alleviate the slower convergence of MC-based Integration, non-uniformly at random sampling algorithms such has Kernel Herding (KH) have been proposed. The herding algorithm [26, 27, 25] was proposed to learn Markov Random Fields (MRFs), and it was applicable to discrete finite-dimensional spaces. This was extended to continuous spaces and infinite dimensions using the kernel trick by [5], who also provided a convergence rate of , which is theoretically the same as convergence rate of MC Integration in general. However, in practice, it has been observed that herding is faster. The speed up can be attributed to a smarter selection of sampling points (rather than uniformly at random) that ensures that the selected samples are somewhat negatively correlated. In addition, [5] also provided a moment matching interpretation of the algorithm. Essentially, solving (1) using the weighted average of a few samples () can be interpreted as minimizing a discrepancy in an Reproducing Kernel Hilbert Space (RKHS). The discrepancy is also widely known to be the Maximum Mean Discrepancy (MMD) metric [5, 11].

In an alternative Bayesian approach, [22, 8] assume a Gaussian Process prior on and use a herding algorithm to sample so as to minimize the corresponding moment matching discrepancy arising out of the posterior of , after observing the points already sampled. This perspective was shown to be equivalent to minimizing the herding discrepancy, but with an additional step of also minimizing the weights attached to the sampled points, instead of using uniform weights of  [11]. Empirically, it was observed that this new algorithm—called Sequential Bayesian Quadrature (SBQ)—converges faster than naive herding because of the additional weight optimization step. To the best of our knowledge, the analysis of faster convergence for SBQ was still largely an open problem before our present work.

A partial explanation for faster convergence in Weighted Kernel Herding (WKH) was provided by [2]. While SBQ chooses the next sample point and the weights to minimize the said discrepancy, the herding algorithm only solves a linearized approximation of the same discrepancy. Once, the sample point is selected, the weights themselves can be optimized. We name this algorithm (solving linearized discrepancy + optimizing weights) as WKH. For the case when the weights are constrained to sum up to [2] noted that WKH is equivalent to the classic Frank-Wolfe [7] algorithm on the marginal polytope. Exploiting this connection, they were able to use convergence results from Frank-Wolfe analysis to analyze constrained WKH. Specifically, if the optimum lies in the relative interior of the marginal polytope, so that it is distance away from the boundary, the convergence rate is . For finite dimensional kernels, they show ; while for infinite dimensional kernels, . Even for finite dimensions, can be arbitrarily close to , rendering the exponential convergence meaningless. It was also pointed out by Bach et al. [2] that the existing theory does not fully justify the much better empirical performance of WKH. To the best of our knowledge, this question has remain unresolved before our present work.

In our present work, we provide an analysis for exponential convergence of unconstrained WKH. As noted by Huszar and Duvenaud [11], the weights in WKH do not need to sum to in many applications, and so the correspondence to Frank-Wolfe and its limitations do not hold for us. Furthermore, our analysis also extends to infinite dimensional kernels, provided a simplifying structural assumption is satisfied.

Central to our analysis is what we call a realizability assumption. By this, we assume that the mean embedding in the kernel space can be exactly reproduced by a linear combination of samples , from the domain. We discuss the assumption and its impact in Section 3, and we emphasize that it is not restrictive. In fact, for many nice cases, or , but more generally, realizability guarantees convergence. The comparative view of the algorithms and their respective rates is presented in Table 1.

A side effect of our analysis is that we are able to scale up the herding algorithm to multiple machines, cutting down on run time and memory requirements. As noted by Huszar and Duvenaud [11], the iteration of KH and WKH require an search over the domain to sample the next point, while SBQ requires . Thus, for large , these algorithms can get slower. Furthermore, they can be memory intensive. Our proposed solution is to split the domain and hence the search to multiple machines, and to run a local WKH/SBQ, thus speeding up the search for the next sample point. The local solutions are then collated to output the final set of samples. Interestingly, we show that this distributed algorithm also exhibits linear convergence under slightly stricter assumptions.

Our contributions in this work are: (a) Theoretical—providing the fastest known convergence rates for two popular sampling algorithms; (b) Algorithmic—providing a new distributed herding algorithm that we also show converges linearly; and (c) Empirical—since there is ample empirical evidence supporting the good performance of the KH/SBQ algorithms in earlier works [11, 5, 2], we focus instead on demonstrating the empirical performance of the distributed algorithm. In more detail, our main contributions are:

• We take a renewed look at convergence analysis through the lens of approximation error after iterations that is typical in discrete optimization. This allows us to analyze and prove linear convergence of two widely used algorithms for approximating expectations, under standard assumptions for well-known setups. Our analysis also provides theoretical justifications for empirical observations already made in the literature.

• We propose the novel niceness assumption of realizability in the optimization problem we study. The assumption trivially holds for many known cases, but even for several harder cases, it acts as a sufficient condition to guarantee linear convergence.

• We consider infinite dimensional cases, and we provide a weaker guarantee for cases that do not follow realizability.

• We propose a novel distributed algorithm for approximating expectations for large scale computations, and we study its convergence properties. To the best of our knowledge, herding has not been applied to such a scale before.

• Finally, we present empirical studies to validate the newly proposed distributed algorithm.

Other related works.

The connection to the Frank-Wolfe algorithm was further extended by Briol et al. [3] who provided new variants and rates. The Frank-Wolfe algorithm [13, 18] has other variants than the one considered by Bach et al. [2] to relate to KH. These variants enjoy faster convergence at the cost of additional memory requirement. Specifically, instead of just selecting sample points, one can think of removing bad points from the set already selected. This variant of Frank-Wolfe, known as FW with away steps, is one of the more commonly used ones in practice, because it is known to converge faster. If the weights are not restricted to lie on the simplex, the analogy to matching pursuit algorithms is obvious. We refer to Locatello et al. [21] for corresponding convergence rates. However, the linear rates for these algorithms require bounding certain geometric properties of the constraint set, and this may not be easy to do for RKHS-based applications that usually employ KH and/or SBQ. There have been some works that provide sufficient conditions for fast convergence of SBQ under additional assumptions on sufficient exploration of point sets [14], or special cases of kernels (e.g. the ones generating Sobolev spaces [23]). Our setting is more general than these works. There have been studies that discuss dimension dependent convergence rates [4] for infinitely smooth functions. Such rates take form of for dimension . On the other hand, our result guarantees linear convergence independent of as long as is finite. Finally, even if is infinite, we show a certain form of approximation guarantee holds. More recently, Kanagawa et al. [15] study convergence properties in misspecified settings, which can be an interesting direction of future work stemming out of our work.

With the goal of interpreting blackbox models, Khanna et al. [17] recently exploited connections with submodular optimization to provide the weaker type convergence we discuss (in Section 3.2) in the form of an approximation guarantee for SBQ for discrete . Our result is more general, since we also address the case of discrete . We do not make use of submodular optimization results. A similar proof technique was also used by Khanna et al. [16] for proving approximation guarantees of low rank optimization. The proof idea for the distributed algorithm was inspired from tracking the optimum set, which is a common theme in analysis of distributed algorithms in discrete optimization (see, e.g., [1] and references therein).

2 Background

In this section, we discuss some relevant background for the algorithms and methods at hand. We begin by establishing some notation. We represent vectors as small letter bolds, e.g., . Matrices are represented by capital bolds, e.g.. . Sets are represented by sans serif fonts, e.g., ; and the complement of a set is . A dot product in an RKHS with kernel is represented as , and the corresponding norm is . The dual norm is written as . We denote as .

Maximum mean Discrepancy (MMD).

MMD measures divergence between two distributions and over the space of functions as:

 MMD(p,q):=supf∈F∥∫f(x)p(x)dx−∫f(x)q(x)∥ =∥∬k(x,y)p(x)p(y)−∬k(x,y)q(x)q(y)∥ (2) =∥μp−μq∥k (3)

where (2) holds if is an RKHS, and where and are the respective mean kernel embeddings of and . MMD() , and if is rich enough MMD() iff . We refer to Sriperumbudur et al. [24], Gretton et al. [9] for further details. The sampling algorithms we consider in this work use the MMD to approximate the underlying density using discrete constructed by the algorithms.

Weighted Kernel Herding.

Our goal is to evaluate the expectation of a function under some distribution . In general, such evaluations may not be tractable if we do not have access to the parametric form of the said density and/or function. In this work, we only assume value oracle access to the function and a sampling oracle access to the distribution. Thus, we intend to approximate the expectation of a function by a weighted sum of a few evaluations of the said function. Say a function is defined on a measurable space . Consider the integral:

 Ep[f(x)]=∫Xf(x)p(x)dx≈n∑i=1wif(xi), (4)

where are the weights associated with function evaluations at . The algorithms used to solve (4) have a common underlying theme of two alternating steps: (1) Generate the next sample point ; and (2) Evaluate the weights for all the samples chosen so far. For example, using and sampling uniformly at random over recovers the standard MC integration. Other methods include Kernel Herding (KH [5]) and quasi-Monte carlo [6], both of which use but use specific schemes to draw . Sequential Bayesian Quadrature (SBQ [22]) goes one step further and employs a non-uniform weight vector as well. For convenience, define the set . Say .

We present a brief overview of these algorithms next, and point the reader to Huszar and Duvenaud [11], Chen et al. [5] for further details. As already noted, solving (4) using a discrete density that approximates the underlying density can be equivalently considered to be minimizing the MMD discrepancy 3 if lies in the said RKHS. Kernel Herding (KH) chooses the next sample by solving the linearized form of the objective discrepancy, and it sets all the weights to be . Say is the dot product in the RKHS defined by the kernel function . The linear objective to solve for the next sample point is given by:

 xKHn+1←argminx∈X1n+1n∑i=1wik(x,xi)−2Ex′∼p[k(x,x′)]. (5)

The SBQ algorithm is slightly more involved. It assumes a functional Gaussian Process (GP) prior on with the kernel function . This implies that the approximate evaluation in (4) is a random variable. The sample is then chosen as the one that minimizes the posterior variance while the corresponding weights are calculated from the posterior mean of this variable. The details follow.

Say we have already chosen points: . From standard properties of Gaussian processes, the posterior of , given the evaluations , has the mean function:

 ^f(x)=k⊤K−1f,

where is the vector of function evaluations , is the vector of kernel evaluations , and is the kernel matrix with . The posterior variance can be written as:

 cov(x,y)=k(x,y)−k⊤K−1k.

The posterior over the function also yields a posterior over the expectation over defined in (4). Then, it is straightforward to see , where . Note that the weights in (4) can be written as . We can write the variance of as:

 var(z(Sn))=∬k(x,y)p(x)p(y)dxdy−z⊤K−1z. (6)

The algorithm Sequential Bayesian Quadrature (SBQ) samples for the points in a greedy fashion with the goal of minimizing the posterior variance of the computed approximate integral:

 xSBQn+1←argminx∈Xvar(z(Sn∪{x})). (7)

It has been shown that SBQ also can be seen to be choosing samples so as to minimize the same discrepancy as the one by KH [11]. Instead of solving a linearized objective as done by KH, it chooses the next sample so as to directly minimize the discrepancy.

In this work, we first analyze Weighted Kernel Herding (WKH) (Algorithm 1) which combines the best of the two algorithms of SBQ and KH by using , with updating weights using SBQ’s . The linearized objective was also considered by [2], with the additional constraint that the weights are positive and sum to . They use the connections to the classic Frank-Wolfe algorithm [7] to establish convergence rates under this additional constraint. However, their linear convergence rate is dependent on how far inwards does the optimum lie in the marginal polytope, and it devolves to the sublinear rate when the optimum can be arbitrarily close to the boundary of the marginal polytope. The additional condition on the weights is not required for many practical applications. We must note, however, that in some applications it is imperative that the iterates lie within the marginal polytope [20, 19].

3 Convergence Results

In this section, we establish linear convergence of WKH. The starting point of our analysis is the following re-interpretation of the posterior variance minimization (6) as a variational optimization of Mean Maximum Discrepancy (MMD) [11]. We can re-write (6) as a function of a set of chosen samples as:

 \vspace−4mmg(S):=∥μp−∑i∈Swiϕ(xi)∥2k, (8)

where is the respective feature mapping, i.e., , and where is the mean embedding in the kernel space.

Before presenting our result, we delineate the assumptions we make on the cost function (8). There is an implicit assumption that the function under consideration lies within the chosen RKHS, otherwise there is no guarantee that any algorithm can approximate within error, and the discussion of convergence rates is meaningless. Other assumptions follow.

Assumption 1 (Realizability):

We assume a set of atoms for , such that .

Discussion: Assumption 1 posits there exists a set of samples in the mapped domain whose weighted average exactly evaluates to the expectation so that the discrepancy is . We emphasize that this assumption is not restrictive. For example, for -dimensional kernels, , simply because the weights allow linear combination. However, it turns out we can do a lot better.

Proposition 1.

Say the embedding kernel space is finite dimensional, and the domain and the mapping are continuous. Further, if the mapped support of is

1. convex, then Assumption 1 holds with .

2. non-convex, then Assumption 1 holds with .

The proof of Prop 1 relies on the fact that the expectation over a distribution supported on a set would lie within the convex hull of the set. For ease of exposition, the proof is relegated to the Section A.2.

For the cases not covered by the assumptions made in Prop 1 i.e. discrete , and/or for infinite dimensional , as long as Assumption 1 is satisfied and can be proved to be so through other means, our work shows that it is sufficient to prove linear convergence. To the best of our knowledge, such a sufficient condition for these harder cases has not been established before, and this is one of our contributions of this work.

We assume that the loss function is -restricted strongly convex and -smooth over , with and .

Assumption 2 is a standard assumption of restricted strong convexity (RSC) and smoothness (RSS) that is standard for linear convergence analysis. Note that as , it may seem like the upper bound requires taking a uniform bound over the entire support of , as is typically done in classical optimization. But it might be unreasonable to assume that for general RKHS. However, by virtue of the algorithm’s design, more specifically because of the weights being allowed to be linear combination of the selected atoms, any new atom being added would not lie within the span of already selected atoms. As such, the kernel matrix of the chosen atoms would have its lower eigenvalue bounded away from , which implies that stays greater than as grows unless no more atoms can be added.

Assumption 3 (Standardization):

We assume that the feature mapping is standardized, i.e., , as assumed in previous works [11], for ease of exposition.

We are now ready to present our main result regarding the linear convergence of the discrepancy metric .

Theorem 2.

Under Assumptions 1 through 3, if is the sequence of iterates produced by Algorithm 1, the function converges as .

Proof Outline. The central idea of the proof is to track and bound the selection of a sample at each iteration, compared to the ideal selection that could have provided the optimum solution. For this purpose, the properties of the selection subproblem (5) and the assumptions are used. The detailed proof is presented in the appendix.

Discussion: Theorem 2 provides a linear convergence rate for WKH  under the sufficient conditions specified in Assumptions 1 through 3. [2] also provided a linear rate for finite dimensional kernels by drawing on the equivalence of the herding algorithm to the Frank-Wolfe algorithm. Their rate is , where is the distance of the optimum from the boundary of the marginal polytope and is the width of the marginal polytope. Our result is independent of these constants; and by Proposition 1, is better than their rates, as long as for finite dimensional kernels. Since can be arbitrarily close to , our results are much stronger and more general in the sense that they are independent of where the optimum lies. Furthermore, our results will hold for infinite dimensional kernels, as long as the above mentioned sufficient conditions are satisfied.

3.1 Relationship with SBQ

The SBQ Algorithm [8, 11] is similar to the WKH algorithm, except that it solves a slightly more involved optimization subproblem to select the next sample point. The algorithm pseudocode is exactly the one presented in Algorithm 1, with step 4 replaced with (7) instead of (5). Thus, WKH solves a linear program every iteration, while SBQ solves a kernel minimization problem. In this and other theoretical work, it is assumed that the oracle that solves the optimization subproblem is given and the convergence rates generally consider the number of calls made to this oracle for desired accuracy of the overall solution. In applications, however, the subproblems can be vastly impactful on convergence, run times, and memory management. It has been observed that SBQ converges faster in practice, which is intuitive since it performs more work per iteration. Furthermore, for distributed algorithm variants like the one presented in Section 4, the cost per iteration may not be a bottleneck due to availability of multiple machines. On the other hand, in addition to higher cost per iteration, the application of SBQ might not be always feasible. We present one such example in Section 5, where SBQ requires memorization of the kernel matrix for a reasonable run time.

While the two algorithms have their specific advantages, the decrease in the cost function (8) per iteration of SBQ is more compared to its decrease per iteration of WKH. This relationship, combined with Theorem 2 establishes linear convergence of SBQ, which to the best of our knowledge, was still an open question (see Table 1 in [11] which lists the convergence rate of SBQ as unknown).

Corollary 3.

Under Assumptions 1 through 3, if is the sequence of iterates produced by SBQ, the function converges as .

3.2 Breaking realizability

While realizability holds with for many nice cases, it is important to discuss cases when realizability breaks. We first consider the case of discrete . Say is the size of the support of discrete , and is the dimensionality of the kernel space. If , from the fact that each iteration only adds an atom that is in the orthogonal complement of previously added samples, . In this case, the same rate will still hold. On the other hand, it is possible that the mean is realizable for no less than samples if . In this case, the linear rate is no longer valid. One can call upon the Frank-Wolfe connection [2], since as long as , we know that the a linear rate holds (because the optimum lies in the relative interior away from the boundary, since the effective rate is minimum of the two). However, as discussed before, the optimum may lie arbitrarily close to the boundary. In such cases, the following statement still holds nonetheless.

Theorem 4.

(Approximation guarantee) Say Assumptions 2 and 3 are satisfied. Let . Say iterations of Algorithms 1 (WKH or SBQ) select the set . Then, .

The proof is presented in Section A.4. Theorem 4 is weaker than a convergence result but it also makes weaker assumptions since it does not need Assumption 1. Instead of providing a rate on how close we are to the optimum after iterations, it provides a contraction factor on how close the algorithm gets to the best possible steps that any algorithm could have taken. This recovers the special case of SBQ for discrete densities studied by Khanna et al. [17] by exploiting connections to weak submodularity. Note that our result is also valid for WKH and provides an alternative proof for SBQ without exploiting weak submodularity.

Theorem 4 is valid for infinite dimensional kernel spaces as well. It is possible that in infinite dimensions, is not realizable for any . We characterize this as a pathological case, since it implies that is not in the span of any finite number of atoms . One can, however, provide the following weaker guarantee. For any projection of onto a finite number of atoms , we need to run atmost iterations of Algorithms 1 (WKH or SBQ), in order to get close to the objective value of the best such -sized projection.

4 Distributed Kernel Herding

In many cases, the search over the domain can be a bottleneck causing the herding algorithm to be too slow to be useful in practice. In this section, we develop a new herding algorithm that can be distributed over multiple machines and run in a streaming map-reduce fashion. We also provide convergence analysis for the same, using techniques developed in Section 3.

The algorithm proceeds as follows. The domain is split onto machines uniformly at random. Each of the machines has access to only , such that and for . Each machine runs its own herding algorithm (Algorithm 1) independently, by specializing (5) to do a restricted search over , instead of . Finally, the iterates being generated by each machine are sent to a central server machine which collates the samples from different machines by running another copy of the same algorithm, with replaced by the discrete set of samples sent to the central server by the other machines. Finally, the best solution out of all the solutions is returned. The pseudo-code is illustrated in Algorithm 2.We now provide the convergence guarantees for this algorithm.

Theorem 5.

Under Assumption 1, with , and Assumptions 2 and 3, if is the sequence of iterates produced by Algorithm 2, the function converges as .

Proof Idea: Note that the final set of filtered iterates outputted are the best out of possible sequences. The proof tracks the possibilities for when is split. The goal is to then show that under all possible scenarios, at least one of the sequences converges linearly. The convergence of individual sequences is based on the proof techniques used in proof of Theorem 2.

We remark that, for the more general case of , our proof technique does not give a valuable convergence rate. In particular, our analysis yields , which is not very meaningful. This could be an artifact of our proof technique, and improving it is an interesting open question for future research. Also, while we presented the analysis for Distributed WKH, by Corollary 3, the same rate also holds for Distributed SBQ.

5 Experiments

We refer the readers to earlier works for empirical evidence of the relative performance of the Herding and SBQ algorithms [2, 5, 25, 26, 27, 11, 8]. In this section, we focus on studying the distributed versions of these algorithms to illustrate the speedup vs the accuracy tradeoff.

5.1 Matching a distribution

In this study, our goal is to show the tradeoff between performance vs scalability when WKH/SBQ are distributed over multiple machines. Towards this end, we extend an experiment considered by Chen et al. [5], Huszar and Duvenaud [11]. In this experiment, the target density is a mixture of two dimensional Gaussian distributions. Samples are chosen by the contending algorithms, and the MMD distance of the sample distribution to the target distribution is reported for different number of samples.

The sampling subroutine (step 4 in Algorithm 1), requires solving an optimization problem over a continuous domain. To make the problem easier, Chen et al. [5], Huszar and Duvenaud [11] first select 10000 points uniformly at random as the set of potential candidates. They note that this does not affect the performance of chosen samples by much. We also adopt the same methodology with the additional step of arbitrarily partitioning these 10000 points over machines for to run Algorithm 2 with the goal of studying the degradation of performance due to the partitioning.

The results are reported in Figure 1. SBQ-5 and WKH-5 are distributed versions of SBQ and WKH respectively run over machines. The labels SBQ and WKH are for their respective single machine versions. Since the search space is split and the search step is parallelized, we get about 5X speedup in the algorithms, with a graceful degradation in the reported MMD. We noticed a similar pattern of degradation for larger number of machines but is omitted from the graph to avoid overcrowding.

5.2 Data Summarization

The task of data summarization is as follows. Our goal is to select a few data samples that represent the data distribution sufficiently well, so that a model built on the selected subsample of the training data does not degrade too much in performance on the unseen test data. More specifically, we are interested in approximating the test distribution (i.e., discrete ) using a few samples from the training set. Hence, algorithms such as SBQ and WKH are applicable, provided we have a reasonable kernel function. We use Fisher kernels [12], since they were recently used to show strong performance for this task [17]. For completeness, we provide a brief overview of the Fisher kernel in the appendix.

Another method that also aims to do training data summarization is that of coreset selection [10], albeit with a different goal of reducing the training data size for optimization speedup while still maintaining guaranteed approximation to the training likelihood. Since the goal itself is optimization speedup, coreset selection algorithms typically employ fast methods, while still trying to capture the data distribution by proxy of the training likelihood. Moreover, the coreset selection algorithm is usually closely tied with the respective model, as opposed to being a model-agnostic.

We employ different variants of WKH/SBQ to the problem of training data summarization under logistic regression, as considered by Huggins et al. [10] using coreset construction. We experiment using three datasets ChemReact, CovType and WebSpam. ChemReact consists of chemicals each of feature size . Out of these, are test data points. The prediction variable is and signifies if a chemical is reactive. CovType has examples each of feature size . Out of these, are test points. The task is to predict whether a type of tree is present in each location or not. WebSpam has 350,000 webpages each having features. Out of these, are test data points. The task here is to predict whether a webpage is spam or not. We refer to Huggins et al. [10] for source of the datasets.

In each of the datasets, we further randomly split the training data into validation and training. We train the logistic regression model on the new training data, and we use the validation set as a proxy to the unseen test set distribution. We build the kernel matrix and the affinity vector , and we run different variants of sampling algorithms to choose samples from the training set to approximate the discrete validation set distribution in the Fisher kernel space. Once the training set samples are extracted, we rebuild the logistic regression model only on the selected samples, and we report negative test likelihood on unseen test data to show how well has the respective algorithm built a model specific dataset summary.

The algorithms we run are WKH, SBQ, -SBQ and -WKH, where represents the number of machines used to select samples for different values of . The smaller ChemReact data fitted on a single machine, so we run WKH and SBQ on single machine. To present the tradeoff, we also run -WKH and -SBQ. These are about times faster than their single machine counterparts, but they degrade in predictive performance. For the baselines, we use the coreset selection algorithm and random data selection as implemented by Huggins et al. [10]. The results are presented in Figure 2. We note that our algorithm yields a significantly better predictive performance compared to random subsets and coresets with the same size of the training subset across different subset sizes. We note that generally SBQ has better performance numbers than WKH for same across different values of . Note that WebSpam and CovType were too big to run on a single machine, and they are thus perfect examples to illustrate the impact and usefulness of the distributed algorithm. The m-WKH-1 experiment presented in Figure 2 is obtained by averaging the output of a single split of the dataset run on one of the machines (as was done by Khanna et al. [17] to scale up sine they do not have the distributed algorithm to work with). All the experiments were run on 12-core 16Gb RAM machines.

Conclusion:

We have presented novel analysis for two commonly used algorithms as well as for a new distributed algorithm for estimating means. Our results bridge the gap between theory and empirical evidence by presenting the fastest known convergence rates for these algorithms. Our realizability assumption is the key insight that allows us to improve upon previous results and also provide the new distributed algorithm. There are quite a few interesting future research directions. An important one is studying convergence for the distributed algorithm for the case with realizability . Can the realizability help in proposing other faster variants? Alternatively, what are the best possible rates for these problems? .

Appendix A Appendix

a.1 Fisher Kernels

Say we have a parametric model that we learn using maximum likelihood estimation, i.e., , where represents the model parameters and represents the data. The notion of similarity that Fisher kernels employ is that if two objects are structurally similar as the model sees them, then slight perturbations in the neighborhood of the fitted parameters would impact the fit of the two objects similarly. In other words, the feature embedding , for an object can be interpreted as a feature mapping which can then be used to define a similarity kernel by a weighted dot product:

 κ(Xi,Xj):=f⊤iI−1fj,

where the matrix is the Fisher information matrix. The information matrix serves to re-scale the dot product, and it is often taken as identity as it loses significance in limit [12]. The corresponding kernel is then called the practical Fisher kernel, and it is often used in practice.

a.2 Proof of Proposition 1

The expectation over any distribution on a set would lie within the convex hull of the set. Thus if the set is convex, by continuity properties, there exists some atom where the optimum is reached, and hence .

However, if the set is non-convex, the expectation may lie outside the set, while it will still be within the convex hull of the set. In this case, one can always draw a line or chord through the point of expectation and through the set passing through atleast two points within the set to touch the boundary of the convex hull at its two ends. Since such a chord can always be drawn, the expected value can be represented as linear combination of the two points within the set, since the three points are colinear. So, . See Figure 3.

a.3 Proof of Theorem 2

For ease of exposition, instead of directly working with , we translate the function to remove any constants not dependent on the variable. We write,

 l(S):=∥μp∥2k−g(S)=z⊤K−1z.

Some auxiliary Lemmas are proved later in this section. We use Further, note that the Assumption 2, when applied for , ensures that for any iterates considered in this proof we have that

 −mω2∥Z(S1)−Z(S2)∥2k≥ l(S1)−l(S2)−⟨∇l(S2),Z(S1)−Z(S2)⟩k ≥−MΩ2∥Z(S1)−Z(S2)∥2k.
Proof.

Say steps of the Algorithm 1 have been performed to select the set . Let be the corresponding weight vector. At the step, is sampled as per (5). Let , so that (as per Lemma 6) . Set weight vector as follows. For , . Set , where is an arbitrary scalar.

From weight optimality proved in Lemma 6,

 l(S∪{xi})−l(S)≥h(S∪{xi},u)−l(S),

for an arbitrary . From Assumption 2 (smoothness),

 l(S∪{xi})−l(S)≥α⟨∇l(S),ϕ(xi)⟩k−α2MΩ2.

Let be the optimum value of the solution of (5). Since is the optimizing atom,

 l(S∪{xi})−l(S)≥αγS−α2MΩ2.

Let be the set obtained by orthogonalizing with respect to using the Gram-Schmidt procedure. Putting in , we get,

 l(S∪{xi})−l(S) ≥12MΩγS (9) ≥12rMΩ∑xj∈S⋆⊥⟨ϕ(xj),∇l(S)⟩2k ≥mωrMΩ(l(S∪S⋆⊥)−l(S)) (10) ≥mωrMΩ(l(S⋆r)−l(S)) =mωrMΩ(∥μp∥2k−l(S)).

The second inequality is true because is the value of (5) in the iteration. The third inequality follows from Lemma 7. The fourth inequality is true because of monotonicity of , and the final equality is true because of Assumption 1 (realizability).

Let . We have . The result now follows.

a.4 Proof of Theorem 4

Proof.

We proceed as in the proof of Theorem 2, but by replacing with . From 10,

 l(S∪{xi})−l(S) ≥mωrMΩ(l(Tr)−l(S)).

Adding and subtracting on the LHS and rearranging,

 l(Tr)−l(S∪{xi}) ≤(1−mωrMΩ)(l(Tr)−l(S)).

Thus after iterations,

 l(Tr)−l(Sk) ≤(1−mωrMΩ)k(l(Tr)−l(∅)).

Rearranging,

 l(Sk) ≥(1−(1−mωrMΩ)k)l(Tr) ≥(1−exp(−kmωrMΩ))l(Tr).

With , we get,

 l(Sk)≥(1−ϵ)l(Tr).

The result now follows. ∎

a.5 Auxiliary Lemmas

The following Lemma proves that the weights in obtained using the posterior inference are an optimum choice that minimize the distance to in the RKHS over any set of weights [17].

Lemma 6.

The residual is orthogonal to . In other words, for any set of samples , .

Proof.

Recall that , and . For an arbitrary index ,

 ⟨μp−∑jwjϕ(xj),ϕ(xi)⟩k = ∫k(x,xi)p(x)d(x)−⟨∑jwjϕ(xj),ϕ(xi)⟩k = zi−⟨∑jwjϕ(xj),ϕ(xi)⟩k = zi−∑jwjk(xj,xi) = = zi−∑tzt∑jKji[K−1]tj = zi−zi,

where the last equality follows by noting that is inner product of row of and row of , which is if and otherwise. This completes the proof. ∎

Lemma 7.

For any set of chosen samples , , let be the operator of projection onto span(). Then, .

Proof.

Observe that

 0 ≤l(S1∪S2)−l(S1) ≤⟨∇l(S1),Z(S1∪S2)−Z(S1)⟩k −mω2∥Z(S1∪S2)−Z(S1)∥2k ≤argmaxX∈span(S1∪S2)⟨∇l(S1),X−Z(S1)⟩k−mω2∥X−Z(S1)∥2k =argmaxX⟨P(∇l(S1)),X−Z(S1)⟩k−mω2∥X−Z(S1)∥2k.

Solving the argmax problem on the RHS for , we get the required result.

Appendix B Proof of Theorem 5

We next present some notation and few lemmas that lead up to the main result of this section (Theorem 5). Let be the -sized solution returned by running Algorithm 1 on . Note that each induces a partition onto the optimal -sized solution as follows ( for this theorem):

 Tj:={x∈S⋆1:x∉% WKH(Xj∪x,k)}, Tcj:={x∈S⋆1:x∈%WKH(Xj∪x,k)}.

In other words, if the and the machine will not select it as among its iterates, and it is empty otherwise, since is a singleton. We re-use the definition of used in Section A.3.

Before moving to the proof of the main theorem, we prove two prerequisites. Recall is the set of iterates selected by machine .

Lemma 8.

For any individual worker machine running local WKH, if is the set of iterates already chosen, then at the selection of next sample point , .

Proof.

We proceed as in proof of Theorem 2 in Section A.3. From (9), we have,

 l(S∪{x})−l(S) ≥12MΩγS ≥12MΩ∑xj∈Tj⟨ϕ(xj),∇l(S)⟩2k ≥mωMΩ(l(S∪Tj)−l(S)) ≥mωMΩ(l(Tj)−l(S)).

Lemma 9.

For the aggregator machine that runs WKH over (step 6 of Algorithm 2), we have, at selection of next sample point after having selected , machine such that

 E[l(S∪{xi})−l(S)]≥mωMΩE(l(Tcj)−l(S)).
Proof.

The expectation is over the random split of into for . We denote to be the complement of . Then, we have that

The equality in step 3 above is because of Lemma 10. ∎

We are now ready to prove Theorem 5.

Proof of Theorem 5.

If, for a random split of , for any , , then the given rate follows from Lemma 8, after following the straightforward steps covered in proof of Theorem 2 for proving the rate from the given condition on . On the other hand, if none of , , then . In this case, the given rate follows from Lemma 9. ∎

Finally, here is the statement and proof of an auxiliary lemma that was used above.

For any .

Proof.

We have

 P(x∈∪jGj) =∑jP(x∈Ai∩x∈WKH% (Ai,k)) =∑jP(x∈Ai)P(x∈WKH(Ai,k)|x∈Ai) =1lP(x∈Si).

b.1 Proof of Corollary 3

Proof.

The proof of Theorem 2 follows the thread of showing that in every iteration WKH makes atleast a constant factor approx progress wrt the best possible progress that could have been made (See (9) in the supplement). SBQ converges at least as fast as WKH, since SBQ makes faster progress per iteration. To see this, consider a single iteration of each from a given set of iterates . Let and be the corresponding sample selected through an iteration of SBQ and WKH respectively. Then, with , from (7),

 g(S∪xSBQ)=minw∈Ri,x∈X∥μ(i−1)−wiϕ(x)∥2k.≤g(S∪xWKH)

References

1. J. Altschuler, A. Bhaskara, G. Fu, V. S. Mirrokni, A. Rostamizadeh and M. Zadimoghaddam (2016) Greedy column subset selection: new bounds and distributed algorithms. ICML. Cited by: §1.
2. F. Bach, S. Lacoste-Julien and G. Obozinski (2012) On the equivalence between herding and conditional gradient algorithms. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, pp. 1355–1362. Cited by: §1, Table 1, §1, §1, §2, §3, §3.2, §5.
3. F. Briol, C. Oates, M. Girolami and M. A. Osborne (2015) Frank-wolfe bayesian quadrature: probabilistic integration with theoretical guarantees. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama and R. Garnett (Eds.), pp. 1162–1170. Cited by: §1.
4. F. Briol, C. J. Oates, M. Girolami, M. A. Osborne and D. Sejdinovic (2019-02) Probabilistic integration: a role in statistical computation?. Statistical Science 34 (1), pp. 1–22. External Links: ISSN 0883-4237, Document Cited by: §1.
5. Y. Chen, M. Welling and A. J. Smola (2010) Super-samples from kernel herding. In UAI, Cited by: Table 1, §1, §1, §2, §2, §5.1, §5.1, §5.
6. J. Dick and F. Pillichshammer (2010) Digital nets and sequences: discrepancy theory and quasi-monte carlo integration. Cambridge University Press, New York, NY, USA. Cited by: §2.
7. M. Frank and P. Wolfe (1956) An algorithm for quadratic programming. Naval research logistics quarterly. Cited by: §1, §2.
8. Z. Ghahramani and C. E. Rasmussen (2003) Bayesian monte carlo. In Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun and K. Obermayer (Eds.), pp. 505–512. External Links: Link Cited by: §1, §3.1, §5.
9. A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf and A. J. Smola (2007) A kernel method for the two-sample-problem. In Advances in Neural Information Processing Systems 19, B. Schölkopf, J. C. Platt and T. Hoffman (Eds.), pp. 513–520. Cited by: §2.
10. J. H. Huggins, T. Campbell and T. Broderick (2016) Coresets for scalable Bayesian logistic regression. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 4080–4088. Cited by: Figure 2, §5.2, §5.2, §5.2.
11. F. Huszar and D. K. Duvenaud (2012) Optimally-weighted herding is Bayesian quadrature. In UAI, Cited by: §1, §1, §1, §1, §1, §2, §2, §3, §3.1, §3.1, §3, §5.1, §5.1, §5.
12. T. S. Jaakkola and D. Haussler (1999) Exploiting generative models in discriminative classifiers. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, Cambridge, MA, USA, pp. 487–493. Cited by: §A.1, §5.2.
13. M. Jaggi (2013) Revisiting Frank-Wolfe: projection-free sparse convex optimization. In ICML, Vol. 28, pp. 427–435. Cited by: §1.
14. M. Kanagawa and P. Hennig (2019) Convergence guarantees for adaptive bayesian quadrature methods. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. Alche-Buc, E. Fox and R. Garnett (Eds.), pp. 6234–6245. Cited by: §1.
15. M. Kanagawa, B. K. Sriperumbudur and K. Fukumizu (2020-02) Convergence Analysis of Deterministic Kernel-Based Quadrature Rules in Misspecified Settings. Foundations of Computational Mathematics 20 (1), pp. 155–194. Cited by: §1.
16. R. Khanna, E. Elenberg, A. G. Dimakis, J. Ghosh and Negahban,Sahand (2017) On approximation guarantees for greedy low rank optimization. In ICML, Cited by: §1.
17. R. Khanna, B. Kim, J. Ghosh and O. Koyejo (2019) Interpreting black box predictions using Fisher kernels. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research. Cited by: §A.5, §1, §3.2, §5.2, §5.2.
18. S. Lacoste-Julien and M. Jaggi (2015) On the Global Linear Convergence of Frank-Wolfe Optimization Variants. In NIPS 2015, pp. 496–504. Cited by: §1.
19. F. Locatello, G. Dresdner, R. Khanna, I. Valera and G. Raetsch (2018) Boosting black box variational inference. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett (Eds.), pp. 3401–3411. Cited by: §2.
20. F. Locatello, R. Khanna, J. Ghosh and G. Raetsch (2018) Boosting variational inference: an optimization perspective. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research. Cited by: §2.
21. F. Locatello, R. Khanna, M. Tschannen and M. Jaggi (2017) A unified optimization view on generalized matching pursuit and Frank-Wolfe. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §1.
22. A. O’Hagan (1991-11) Bayes-Hermite quadrature. Journal of Statistical Planning and Inference 29 (3), pp. 245 – 260. Cited by: §1, §2.
23. G. Santin and B. Haasdonk (2016) Convergence rate of the data-independent -greedy algorithm in kernel-based approximation. External Links: 1612.02672 Cited by: §1.
24. B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf and G. R.G. Lanckriet (2010-08) Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res. 11, pp. 1517–1561. External Links: ISSN 1532-4435 Cited by: §2.
25. M. Welling and Y. Chen (2010-06) Statistical inference using weak chaos and infinite memory. Journal of Physics: Conference Series 233, pp. 012005. Cited by: §1, §5.
26. M. Welling (2009) Herding dynamic weights for partially observed random field models. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pp. 599–606. Cited by: §1, §5.
27. M. Welling (2009) Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA, pp. 1121–1128. Cited by: §1, §5.