On Linear Convergence of Weighted Kernel Herding

# On Linear Convergence of Weighted Kernel Herding

Rajiv Khanna Department of Statistics, University of California at Berkeley, Email: rajivak@berkeley.edu      Michael W. Mahoney ICSI and Department of Statistics, University of California at Berkeley, Email: mmahoney@stat.berkeley.edu
###### Abstract

We provide a novel convergence analysis of two popular sampling algorithms, Weighted Kernel Herding and Sequential Bayesian Quadrature, that are used to approximate the expectation of a function under a distribution. Existing theoretical analysis was insufficient to explain the empirical successes of these algorithms. We improve upon existing convergence rates to show that, under mild assumptions, these algorithms converge linearly. To this end, we also suggest a simplifying assumption that is true for most cases in finite dimensions, and that acts as a sufficient condition for linear convergence to hold in the much harder case of infinite dimensions. When this condition is not satisfied, we provide a weaker convergence guarantee. Our analysis also yields a new distributed algorithm for large-scale computation that we prove converges linearly under the same assumptions. Finally, we provide an empirical evaluation to test the proposed algorithm for a real world application.

## 1 Introduction

Estimating expectations of functions is a common problem that is fundamental to many applications in machine learning, e.g., computation of marginals, prediction after marginalization of latent variables, calculating the risk, bounding the generalization error, estimating sufficient statistics, etc. In all these cases, the goal is to approximate tractably the integral

 Ep[f(x)]=∫Xf(x)p(x)dx, (1)

for a given function , with value oracle access, and density , with sampling oracle access. In most cases of interest, the integral is not computable in closed form. A common solution is to sample from the domain of , and then to approximate the full integration using the samples. The simplest version of this is to use vanilla Monte Carlo (MC) Integration—sample uniformly at random from the domain of , and then output the empirical average as the estimate of the integral. This simple algorithm converges at a prohibitively slow rate of . More generally, one can apply Markov Chain Monte Carlo (MCMC) algorithms for (1), and these algorithms come with many benefits, but they also converge quite slowly.

To alleviate the slower convergence of MC-based Integration, non-uniformly at random sampling algorithms such has Kernel Herding (KH) have been proposed. The initial herding algorithm [18, 19, 20] was proposed to learn Markov Random Fields (MRFs), and it was applicable to discrete finite-dimensional spaces. This was extended to continuous spaces and infinite dimensions using the kernel trick by [3], who also provided a convergence rate of , which is faster than the convergence rate of MC Integration. The speed up can be attributed to a smarter selection of sampling points (rather than uniformly at random) that ensures that the selected samples are somewhat negatively correlated. In addition, [3] also provided a moment matching interpretation of the algorithm. Essentially, solving (1) using the weighted average of a few samples () can be interpreted as minimizing the discrepancy in a Reproducing Kernel Hilbert Space (RKHS). The discrepancy is also widely known to be the Maximum Mean Discrepancy (MMD) metric [3, 8].

In an alternative Bayesian approach, [15, 6] assume a Gaussian Process prior on and use a herding algorithm to sample so as to minimize the corresponding moment matching discrepancy arising out of the posterior of , after observing the points already sampled. This perspective was shown to be equivalent to minimizing the herding discrepancy, but with an additional step of also minimizing the weights attached to the sampled points, instead of using uniform weights of  [8]. Empirically, it was observed that this new algorithm—called Sequential Bayesian Quadrature (SBQ)—converges faster than naive herding because of the additional weight optimization step. To the best of our knowledge, the analysis for faster convergence for SBQ was still largely an open problem before our present work.

A partial explanation for faster convergence in Weighted Kernel Herding (WKH) was provided by [2]. While SBQ chooses the next sample point and the weights to minimize the said discrepancy, the herding algorithm only solves a linearized approximation of the same discrepancy. Once, the sample point is selected, the weights themselves can be optimized. We name this algorithm (solving linearized discrepancy + optimizing weights) as WKH. For the case when the weights are constrained to sum up to [2] noted that WKH is equivalent to the classic Frank-Wolfe [5] algorithm on the marginal polytope. Exploiting this connection, they were able to use convergence results from Frank-Wolfe analysis to analyze constrained WKH. Specifically, if the optimum lies in the relative interior of the marginal polytope, so that it is distance away from the boundary, the convergence rate is . For finite dimensional kernels, they show ; while for infinite dimensional kernels, . Even for finite dimensions, can be arbitrarily close to , rendering the exponential convergence meaningless. It was also pointed out by [2] that the existing theory does not fully justify the much better empirical performance of WKH. To the best of our knowledge, this question has remain unresolved before our present work.

In our present work, we provide an analysis for exponential convergence of unconstrained WKH. As noted by [8], the weights in WKH do not need to sum to in many applications, and so the correspondence to Frank-Wolfe and its limitations do not hold for us. Furthermore, our analysis also extends to non-convex support of and infinite dimensional kernels, provided a simplifying structural assumption is satisfied.

Central to our analysis is what we call a realizability assumption. We assume that the mean in (1) can be exactly reproduced by a linear combination of samples , from the domain. We discuss the assumption and its impact in Section 3, and we emphasize that it is not restrictive. In fact, for many nice cases, , but more generally, realizability guarantees convergence. The comparative view of the algorithms and their respective rates are presented in Table 1.

A side effect of our analysis is that we are able to scale up the herding algorithm to multiple machines, cutting down on run time and memory requirements. As noted by [8], the iteration of KH and WKH require an search over the domain to sample the next point, while SBQ takes . Thus, for larger , these algorithms can get slower. Our proposed solution is to split the domain and hence the search to multiple machines, and to run a local WKH/SBQ, thus speeding up the search for the next sample point. The local solutions are then collated to output the final set of samples. Interestingly, we show that this distributed algorithm also exhibits linear convergence.

Our contributions in this work are: (a) Theoretical—providing the fastest known convergence rates for two popular sampling algorithms; (b) Algorithmic—providing a new distributed herding algorithm that we also show converges linearly; and (c) Empirical—since there is ample empirical evidence supporting the good performance of the KH/SBQ algorithms in earlier works [8, 3, 2], we focus on and demonstrate the empirical performance of the distributed algorithm in a real world application. In more detail, our main contributions are:

• We analyze and prove linear convergence of two widely used algorithms for approximating expectations, under standard assumptions for well-known setups. Our analysis also provides theoretical justifications for empirical observations already made in the literature.

• We propose the novel niceness assumption of realizability in the optimization problem we study. The assumption trivially holds for many known cases, but even for several harder cases, it acts as a sufficient condition to guarantee linear convergence.

• We consider pathological infinite dimensional cases, and we provide a weaker convergence guarantee for cases that do not follow realizability.

• We propose a novel distributed algorithm for approximating expectations for large scale computations, and we study its convergence properties.

• Finally, we present empirical studies to validate the newly proposed algorithm.

#### Other related works.

The Frank-Wolfe algorithm [10, 11] has other variants than the one considered by [2] to relate to KH. These variants enjoy faster convergence at the cost of additional memory requirement. Specifically, instead of just selecting sample points, one can think of removing bad points from the set already selected. This variant of Frank-Wolfe, known as FW with away steps, is one of the more commonly used ones in practice, because it is known to converge faster. If the weights are not restricted to lie on the simplex, the analogy to matching pursuit algorithms is obvious. We refer to [14] for corresponding convergence rates. However, the linear rates for these algorithms require bounding certain geometric properties of the constraint set, and this may not be easy to do for RKHS-based applications that usually employ KH and/or SBQ. With the goal of interpreting blackbox models, [16] recently exploited connections with submodular optimization to provide the weaker type convergence we discuss (in Section 3.2) in the form of an approximation guarantee for SBQ for discrete . Our result is more general, since we also address the case of discrete (in Section 3.2). We do not make use of submodular optimization results, but instead we directly employ simple linear algebra after the key insight of realizability. A similar proof technique was also used by [17] for proving approximation guarantees of low rank optimization. The proof idea for the distributed algorithm was inspired from tracking the optimum set, which is a common theme in analysis of distributed algorithms (see, e.g., [1] and references therein).

#### Outline.

The rest of the paper is as follows. We present relevant background on the algorithms in Section 2. The main theoretical results are presented in Section 3, albeit most of the proofs are relegated to the appendix; and the algorithm for distributed kernel herding is presented in Section 4. Finally, we provide empirical results in Section 5; and we provide a brief conclusion in Section 6.

## 2 Background

In this section, we discuss some relevant background for the algorithms and methods at hand. We begin by establishing some notation. We represent vectors as small letter bolds, e.g., . Matrices are represented by capital bolds, e.g.. . Matrix transposes are represented by superscript . We use to represent probability densities over random variables, which may be scalar, vector, or matrix valued, which shall be clear from context. Sets are represented by sans serif fonts, e.g., ; and the complement of a set is . A dot product in an RKHS with kernel is represented as , and the corresponding norm is . The dual norm is written as . We denote as .

#### Weighted Kernel Herding.

Our goal is to evaluate the expectation of a function under some distribution . In general, such evaluations may not be tractable if we do not have access to the parametric form of the said density and/or function. In this work, we only assume value oracle access to the function and a sampling oracle access to the distribution. Thus, we intend to approximate the expectation of a function by a weighted sum of a few evaluations of the said function. Say a function is defined on a measurable space . Consider the integral:

 Ep[f(x)]=∫Xf(x)p(x)dx≈n∑i=1wif(xi), (2)

where are the weights associated with function evaluations at . The algorithms used to solve (2) have a common underlying theme of two alternating steps: (1) Generate the next sample point ; and (2) Evaluate the weights for all the samples chosen so far. For example, using and sampling uniformly at random over recovers the standard MC integration. Other methods include Kernel Herding (KH [3]) and quasi-Monte carlo [4], both of which use but use specific schemes to draw . Sequential Bayesian Quadrature (SBQ [15]) goes one step further and employs a non-uniform weight vector as well. For convenience, define the set . Say .

We present a brief overview of these algorithms next, and point the reader to [8, 3] for further details. As already noted, solving (2) can be equivalently considered to be minimizing a discrepancy in an RKHS. Kernel Herding (KH) chooses the next sample by solving the linearized form of the objective discrepancy, and it sets all the weights to be . Say is the dot product in the RKHS defined by the kernel function . If is the current estimate of the integral (2), then the linear objective to solve for the next sample point is given by:

 xKHn+1←argminx∈X⟨Z(Sn),x⟩k=argminx∈Xn∑i=1wik(x,xi). (3)

The SBQ algorithm is more complicated. It assumes a functional Gaussian Process (GP) prior on with the kernel function . This implies that the approximate evaluation in (2) is a random variable. The sample is then chosen as the one that minimizes the posterior variance while the corresponding weights are calculated from the posterior mean of this variable. The details follow.

Say we have already chosen points: . From standard properties of Gaussian processes, the posterior of , given the evaluations , has the mean function:

 ^f(x)=k⊤K−1f,

where is the vector of function evaluations , is the vector of kernel evaluations , and is the kernel matrix with . The quadrature estimate provides not only the mean, but the full distribution as its posterior. The posterior variance can be written as:

 cov(x,y)=k(x,y)−k(x,X)K−1k(X,y),

where is the matrix formed by stacking , and the kernel function notation is overloaded so that represents the column vector obtained by stacking . The posterior over the function also yields a posterior over the expectation over defined in (2). Then, it is straightforward to see , where . Note that the weights in (2) can be written as . We can write the variance of as:

 var(Z(Sn))=∬k(x,y)p(x)p(y)dxdy−z⊤K−1z. (4)

The algorithm Sequential Bayesian Quadrature (SBQ) samples for the points in a greedy fashion with the goal of minimizing the posterior variance of the computed approximate integral:

 xSBQn+1←argminx∈Xvar(Z(Sn∪{x})). (5)

It has been shown that SBQ also can be seen to be choosing samples so as to minimize the same discrepancy as the one by KH. Instead of solving a linearized objective as done by KH, it chooses the next sample so as to directly minimize the discrepancy i.e. .

In this work, we first analyze Weighted Kernel Herding (WKH) (Algorithm 1) which combines the best of the two algorithms of SBQ and KH by using , with updating weights using SBQ’s . The linearized objective was also considered by [2], with the additional constraint that the weights are positive and sum to . They use the connections to the classic Frank-Wolfe algorithm [5] to establish convergence rates under this additional constraint. However, their linear convergence rate is dependent on how far inwards does the optimum lie in the marginal polytope, and it devolves to the sublinear rate when the optimum can be arbitrarily close to the boundary of the marginal polytope. As noted by [8], the additional condition on the weights is not required for many practical applications. We must note, however, that in some applications it is imperative that the iterates lie within the marginal polytope [13, 12].

## 3 Convergence Results

In this section, we establish linear convergence of WKH. The starting point of our analysis is the following re-interpretation of the posterior variance minimization (4) as a variational optimization of Mean Maximum Discrepancy (MMD) [8]. We can re-write (4) as a function of a set of chosen samples as:

 \vspace−4mmg(S):=∥μp−∑i∈Swiϕ(xi)∥2k, (6)

where is the respective feature mapping, i.e., , and where is the mean embedding in the kernel space.

Before presenting our result, we delineate the assumptions we make on the cost function (6).

#### Assumption 1 (Realizability):

We assume a set of atoms for , such that .

Assumption 1 posits there exists a set of samples in the domain whose weighted average exactly evaluates to the expectation (2) of under . We emphasize that this assumption is not restrictive. Consider the case for finite dimensional , convex domain of continuous density , and continuous . Then, the expectation will lie within . Hence, just from continuity properties, the average (2) must be obtained somewhere on the domain for some , and so . For non-convex , the expectation might lie outside , but a linear combination of small number of points in may still reproduce this value. For discrete , and/or for infinite dimensional , if Assumption 1 is satisfied, we prove that linear convergence holds. To the best of our knowledge, such a sufficient condition for these harder cases has not been established before, and this is one of our contributions of this work.

We assume that the eigenvalues of the kernel matrix formed by applying the kernel function on are lower bounded by and upped bounded by .

Assumption 2 is a standard assumption of restricted strong convexity (RSC) and smoothness (RSS) over the cost function (6).

#### Assumption 3 (Standardization):

We assume that the feature mapping is standardized, i.e., , as assumed in previous works [8] for ease of exposition.

###### Theorem 1.

Under Assumptions 1 through 3, if is the sequence of iterates produced by Algorithm 1, the function converges as .

Proof Outline. The central idea of the proof is to track and bound the selection of a sample at each iteration, compared to the ideal selection that could have provided optimum solution. For this purpose, the properties of the selection subproblem (3) and the assumptions are used. The detailed proof is presented in the appendix.

### 3.1 Relationship with SBQ

The SBQ Algorithm [6, 8] is similar to the WKH algorithm, except that it solves a slightly more involved optimization subproblem to select the next sample point. The algorithm pseudocode is exactly the one presented in Algorithm 1, with step 4 replaced with (5) instead of (3). Thus, WKH solves a linear program every iteration, while SBQ solves a kernel minimization problem. In this and other theoretical works, it is assumed that the oracle that solves the optimization subproblem is given and the convergence rates generally consider the number of calls made to this oracle for desired accuracy of the overall solution. In application, however, the subproblems can be vastly impactful on convergence, run times, and memory management. It has been observed that SBQ converges faster in practice, which is intuitive since it performs more work per iteration. Furthermore, for distributed algorithm variants like the one presented in Section 4, the cost per iteration may not be a bottleneck due to availability of multiple machines. On the other hand, in addition to higher cost per iteration, the application of SBQ might not be always feasible. We present one such example in Section 5, where SBQ requires memorization of the kernel matrix for a reasonable run time.

While the two algorithms have their specific advantages, the decrease in the cost function (6) per iteration of SBQ is more compared to its decrease per iteration of WKH. This relationship, combined with Theorem 1 establishes linear convergence of SBQ, which to the best of our knowledge, was still an open question (see Table 1 in [8] which lists the convergence rate of SBQ as unknown).

###### Corollary 2.

Under Assumptions 1 through 3, if is the sequence of iterates produced by SBQ, the function converges as .

###### Proof.

The proof follows from the fact that SBQ converges at least as fast as WKH, since SBQ makes faster progress per iteration. To see this, consider a single iteration of each from a given set of iterates . Let and be the corresponding sample selected through an iteration of SBQ and WKH respectively. Then, from (5),

### 3.2 Breaking realizability

While realizability trivially holds with for many nice cases, it is important to discuss cases when realizability breaks. We first consider the case of discrete . Say is the size of the support of discrete , and is the dimensionality of the space. If , from the fact that each iteration only adds a sample that is in the orthogonal complement of previously added samples, . In this case, the same rate will still hold. On the other hand, it is possible that the mean is realizable for no less than samples if . In this case, the linear rate is no longer valid. One can call upon the Frank-Wolfe connection [2], since as long as , we know that the a linear rate holds (because the optimum lies in the relative interior away from the boundary, since the effective rate is minimum of the two). However, as discussed before, the optimum may lie arbitrarily close to the boundary. In such cases, the following statement still holds nonetheless. Say , then for iterations of Algorithms 1 or 2 (WKH or SBQ), to select the will have .

Things are trickier for infinite dimensional kernels. It is possible that in infinite dimensions, is not realizable for any . We characterize this as a pathological case, since it implies that is not in the span of any finite number of atoms . From Theorem 1, one can, however, provide the following weaker guarantee. For any projection of onto a finite number of atoms , we need to run atmost iterations of Algorithms 1 or 2 with WKH or SBQ, in order to get close to the objective value of the best such -sized projection. The proof of this statement follows from replacing with the projection in the analysis of Theorem 1.

## 4 Distributed Kernel Herding

In many cases, the search over the domain can be a bottleneck causing the herding algorithm to be too slow to be useful in practice. In this section, we develop a new herding algorithm that can be distributed over multiple machines and run in a streaming map-reduce fashion. We also provide convergence analysis for the same, using techniques developed in Section 3.

The algorithm proceeds as follows. The domain is split onto machines uniformly at random. Each of the machines has access to only , such that and for . Each machine runs its own herding algorithm (Algorithm 1) independently, by specializing (3) to do a restricted search over , instead of . Finally, the iterates being generated by each machine are sent to a central server machine which collates the samples from different machines by running another copy of the same algorithm, with replaced by the discrete set of samples sent to the central server by the other machines. Finally, the best solution out of all the solutions is returned. The pseudo-code is illustrated in Algorithm 2.We now provide the convergence guarantees for this algorithm.

###### Theorem 3.

Under Assumption 1, with , and Assumptions 2 and 3, if is the sequence of iterates produced by Algorithm 2, the function converges as .

Proof Idea: Note that the final set of filtered iterates outputted are the best out of possible sequences. The proof tracks the possibilities for when is split. The goal is to then show that under all possible scenarios, at least one of the sequences converges linearly. The convergence of individual sequences is based on the proof techniques used in proof of Theorem 1.

We remark that, for the more general case of , our proof technique does not give a valuable convergence rate. In particular, our analysis yields , which is not very meaningful. This could be an artifact of our proof technique, and improving it is an interesting open question for future research. Also, while we presented the analysis for Distributed WKH, by Corollary 2, the same rate also holds for Distributed SBQ.

## 5 Experiments

We refer the readers to earlier works for empirical evidence of the good performance of the Kernel Herding and SBQ algorithms [2, 3, 20, 18, 19, 8, 6]. In this section, we focus on studying these algorithms and the distributed variant on a real world problem of data summarization using coresets on three different datasets of varying sizes.

The task of data summarization is as follows. Our goal is to select a few data samples that represent the data distribution sufficiently well, so that a model built on the selected subsample of the training data does not degrade too much in performance on the unseen test data. More specifically, we are interested in approximating the test distribution (i.e., discrete ) using a few samples from the training set. Hence, algorithms such as SBQ and WKH are applicable, provided we have a reasonable kernel function. We use Fisher kernels [9], since they were recently used to show strong performance for this task [16]. For completeness, we provide a brief overview of the Fisher kernel in the appendix.

Another method that also aims to do training data summarization is that of coreset selection [7], albeit with a different goal of reducing the training data size for optimization speedup while still maintaining guaranteed approximation to the training likelihood. Since the goal itself is optimization speedup, coreset selection algorithms typically employ fast methods, while still trying to capture the data distribution by proxy of the training likelihood. Moreover, the coreset selection algorithm is usually closely tied with the respective model, as opposed to being a model-agnostic.

We employ different variants of WKH/SBQ to the problem of training data summarization under logistic regression, as considered by [7] using coreset construction. We experiment using three datasets ChemReact, CovType and WebSpam. ChemReact consists of chemicals each of feature size . Out of these, are test data points. The prediction variable is and signifies if a chemical is reactive. CovType has examples each of feature size . Out of these, are test points. The task is to predict whether a type of tree is present in each location or not. WebSpam has 350,000 webpages each having features. Out of these, are test data points. The task here is to predict whether a webpage is spam or not. We refer to [7] for source of the datasets.

In each of the datasets, we further randomly split the training data into validation and training. We train the logistic regression model on the new training data, and we use the validation set as a proxy to the unseen test set distribution. We build the kernel matrix and the affinity vector , and we run different variants of sampling algorithms to choose samples from the training set to approximate the discrete validation set distribution in the Fisher kernel space. Once the training set samples are extracted, we rebuild the logistic regression model only on the selected samples, and we report negative test likelihood on unseen test data to show how well has the respective algorithm built a model specific dataset summary.

The algorithms we run are WKH, SBQ, -SBQ and -WKH, where represents the number of machines used to select samples for different values of . The smaller ChemReact data fitted on a single machine, so we run WKH and SBQ on single machine. To present the tradeoff, we also run -WKH and -SBQ. These are about times faster than their single machine counterparts, but they degrade in predictive performance. For the baselines, we use the coreset selection algorithm and random data selection as implemented by [7]. The results are presented in Figure 1. We note that our algorithm yields a significantly better predictive performance compared to random subsets and coresets [7] with the same size of the training subset across different subset sizes. We note that generally SBQ has better performance numbers than WKH for same across different values of . Note that WebSpam and CovType were too big to run on a single machine, and they are thus perfect examples to illustrate the impact and usefulness of the distributed algorithm. The m-WKH-1 experiment presented in Figure 1 is obtained by averaging the output of a single split of the dataset run on one of the machines (as was done by [16] to scale up sine they do not have the distributed algorithm to work with). All the experiments were run on 12-core 16Gb RAM machines.

An interesting practical consideration between SBQ and WKH is that the inner subproblem (5) is harder than (3), not just in time complexity but also in its memory requirement. This is because solving  (5) will make repeated calls to calculate dot products across iterations, and this can rack up the run time quickly if the kernel matrix is pre-computed and stored in memory. On the other hand, (3) does not have this shortcoming, and the kernel dot product can be calculated as required without significantly impacting the runtimes.

## 6 Conclusion

We have presented novel analysis for two commonly used algorithms as well as for a new distributed algorithm for estimating means. Our results bridge the gap between theory and empirical evidence by presenting the fastest known convergence rates for these algorithms. Our realizability assumption is the key insight that allows us to improve upon previous results, and at the same time we provide sufficient conditions for convergence in the pathological infinite dimensional case. There are quite a few interesting future research directions. An important one is studying convergence for the distributed algorithm for the case with realizability . Can the realizability help in proposing other faster variants? Alternatively, what are the best possible information theoretic rates for these problems? We hope our work inspires future works in these and other related directions.

## References

• [2] Francis Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the equivalence between herding and conditional gradient algorithms. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, pages 1355–1362, 2012.
• [3] Yutian Chen, Max Welling, and Alexander J. Smola. Super-samples from kernel herding. In UAI, 2010.
• [4] Josef Dick and Friedrich Pillichshammer. Digital Nets and Sequences: Discrepancy Theory and Quasi-Monte Carlo Integration. Cambridge University Press, New York, NY, USA, 2010.
• [5] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 1956.
• [6] Zoubin Ghahramani and Carl E. Rasmussen. Bayesian monte carlo. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 505–512. MIT Press, 2003.
• [7] Jonathan H. Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable Bayesian logistic regression. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 4080–4088, 2016.
• [8] Ferenc Huszar and David K. Duvenaud. Optimally-weighted herding is Bayesian quadrature. In UAI, 2012.
• [9] Tommi S. Jaakkola and David Haussler. Exploiting generative models in discriminative classifiers. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, pages 487–493, Cambridge, MA, USA, 1999. MIT Press.
• [10] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, volume 28, pages 427–435, 2013.
• [11] Simon Lacoste-Julien and Martin Jaggi. On the Global Linear Convergence of Frank-Wolfe Optimization Variants. In NIPS 2015, pages 496–504, 2015.
• [12] Francesco Locatello, Gideon Dresdner, Rajiv Khanna, Isabel Valera, and Gunnar Raetsch. Boosting black box variational inference. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3401–3411. Curran Associates, Inc., 2018.
• [13] Francesco Locatello, Rajiv Khanna, Joydeep Ghosh, and Gunnar Raetsch. Boosting variational inference: An optimization perspective. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research. PMLR, 2018.
• [14] Francesco Locatello, Rajiv Khanna, Michael Tschannen, and Martin Jaggi. A unified optimization view on generalized matching pursuit and Frank-Wolfe. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
• [15] A O’Hagan. Bayes-Hermite quadrature. Journal of Statistical Planning and Inference, 29(3):245 – 260, 11 1991.
• [16] Rajiv Khanna, Been Kim, Joydeep Ghosh, Oluwasanmi Koyejo . Interpreting black box predictions using Fisher kernels. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research. PMLR, 2019.
• [17] Rajiv Khanna, Ethan Elenberg, Alexandros G. Dimakis, Joydeep Ghosh, Sahand Negahban. On approximation guarantees for greedy low rank optimization. In ICML, 2017.
• [18] Max Welling. Herding dynamic weights for partially observed random field models. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages 599–606. AUAI Press, 2009.
• [19] Max Welling. Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 1121–1128, New York, NY, USA, 2009. ACM.
• [20] Max Welling and Yutian Chen. Statistical inference using weak chaos and infinite memory. Journal of Physics: Conference Series, 233:012005, jun 2010.

## Appendix A Appendix

### a.1 Fisher Kernels

Say we have a parametric model that we learn using maximum likelihood estimation, i.e., , where represents the model parameters and represents the data. The notion of similarity that Fisher kernels employ is that if two objects are structurally similar as the model sees them, then slight perturbations in the neighborhood of the fitted parameters would impact the fit of the two objects similarly. In other words, the feature embedding , for an object can be interpreted as a feature mapping which can then be used to define a similarity kernel by a weighted dot product:

 κ(Xi,Xj):=f⊤iI−1fj,

where the matrix is the Fisher information matrix. The information matrix serves to re-scale the dot product, and it is often taken as identity as it loses significance in limit [9]. The corresponding kernel is then called the practical Fisher kernel, and it is often used in practice.

### a.2 Proof of Theorem 1

For ease of exposition, instead of directly working with , we translate the function to remove any constants not dependent on the variable. We write,

 l(S):=∥μp∥2k−g(S)=z⊤K−1z.

Some auxiliary Lemmas are proved later in this section. Further, note that the Assumption 2, when applied for , ensures that for any iterates considered in this proof we have that

 −mω2∥Z(S1)−Z(S2)∥2k≥l(S1)−l(S2)−⟨∇l(S2),Z(S1)−Z(S2)⟩k≥−MΩ2∥Z(S1)−Z(S2)∥2k.
###### Proof.

Say steps of the Algorithm 1 have been performed to select the set . Let be the corresponding weight vector. At the step, is sampled as per (3). Let , so that (as per Lemma 4) . Set weight vector as follows. For , . Set , where is an arbitrary scalar.

From weight optimality proved in Lemma 4,

 l(S∪{xi})−l(S)≥h(S∪{xi},u)−l(S),

for an arbitrary . From Assumption 2 (smoothness),

 l(S∪{xi})−l(S)≥α⟨∇l(S),ϕ(xi)⟩k−α2MΩ2.

Let be the optimum value of the solution of (3). Since is the optimizing atom,

 l(S∪{xi})−l(S)≥αγS−α2MΩ2.

Let be the set obtained by orthogonalizing with respect to using the Gram-Schmidt procedure. Putting in , we get,

 l(S∪{xi})−l(S) ≥12MΩγS (7) ≥12rMΩ∑xj∈S⋆⊥⟨ϕ(xj),∇l(S)⟩2k ≥mωrMΩ(l(S∪S⋆⊥)−l(S)) (8) ≥mωrMΩ(l(S⋆r)−l(S)) =mωrMΩ(∥μp∥2k−l(S)).

The second inequality is true because is the value of (3) in the iteration. The third inequality follows from Lemma 5. The fourth inequality is true because of monotonicity of , and the final equality is true because of Assumption 1 (realizability).

Let . We have . The result now follows.

### a.3 Auxiliary Lemmas

The following Lemma proves that the weights in obtained using the posterior inference are an optimum choice that minimize the distance to in the RKHS over any set of weights [16].

###### Lemma 4.

The residual is orthogonal to . In other words, for any set of samples , .

###### Proof.

Recall that , and . For an arbitrary index ,

 ⟨μp−∑jwjϕ(xj),ϕ(xi)⟩k = ∫k(x,xi)p(x)d(x)−⟨∑jwjϕ(xj),ϕ(xi)⟩k = zi−⟨∑jwjϕ(xj),ϕ(xi)⟩k = zi−∑jwjk(xj,xi) = = zi−∑tzt∑jKji[K−1]tj = zi−zi,

where the last equality follows by noting that is inner product of row of and row of , which is if and otherwise. This completes the proof. ∎

###### Lemma 5.

For any set of chosen samples , , let be the operator of projection onto span(). Then, .

###### Proof.

Observe that

 0≤l(S1∪S2)−l(S1) ≤⟨∇l(S1),Z(S1∪S2)−Z(S1)⟩k−mω2∥Z(S1∪S2)−Z(S1)∥2k ≤argmaxX∈span(S1∪S2)⟨∇l(S1),X−Z(S1)⟩k−mω2∥X−Z(S1)∥2k =argmaxX⟨P(∇l(S1)),X−Z(S1)⟩k−mω2∥X−Z(S1)∥2k.

Solving the argmax problem on the RHS for , we get the required result.

## Appendix B Proof of Theorem 3

We next present some notation and few lemmas that lead up to the main result of this section (Theorem 3). Let be the -sized solution returned by running Algorithm 1 on . Note that each induces a partition onto the optimal -sized solution as follows ( for this theorem):

 Tj:={x∈S⋆1:x∉% WKH(Xj∪x,k)}, Tcj:={x∈S⋆1:x∈%WKH(Xj∪x,k)}.

In other words, if the and the machine will not select it as among its iterates, and it is empty otherwise, since is a singleton. We re-use the definition of used in Section A.2.

Before moving to the proof of the main theorem, we prove two prerequisites. Recall is the set of iterates selected by machine .

###### Lemma 6.

For any individual worker machine running local WKH, if is the set of iterates already chosen, then at the selection of next sample point , .

###### Proof.

We proceed as in proof of Theorem 1 in Section A.2. From (7), we have,

 l(S∪{x})−l(S) ≥12MΩγS ≥12MΩ∑xj∈Tj⟨ϕ(xj),∇l(S)⟩2k ≥mωMΩ(l(S∪Tj)−l(S)) ≥mωMΩ(l(Tj)−l(S)).

###### Lemma 7.

For the aggregator machine that runs WKH over (step 6 of Algorithm 2), we have, at selection of next sample point after having selected , machine such that

 E[l(S∪{xi})−l(S)]≥mωMΩE(l(Tcj)−l(S)).
###### Proof.

The expectation is over the random split of into for . We denote to be the complement of . Then, we have that

 E[l(S∪{xi})−l(S)] ≥E[12MΩγS] ≥12MΩ∑x∈S⋆1P(x∈∪jGj)E⟨ϕ(x),∇l(S)⟩2k =12sMΩ∑x∈S⋆1[s∑b=1P(x∈Tcb)]E⟨ϕ(x),∇l(S)⟩2k ≥mωMΩminb∈[s]E(l(Tcb)−l(S)).

The equality in step 3 above is because of Lemma 8. ∎

We are now ready to prove Theorem 3.

###### Proof of Theorem 3.

If, for a random split of , for any , , then the given rate follows from Lemma 6, after following the straightforward steps covered in proof of Theorem 1 for proving the rate from the given condition on . On the other hand, if none of , , then . In this case, the given rate follows from Lemma 7. ∎

Finally, here is the statement and proof of an auxiliary lemma that was used above.

For any .

###### Proof.

We have

 P(x∈∪jGj) =∑jP(x∈Ai∩x∈WKH% (Ai,k)) =∑jP(x∈Ai)P(x∈WKH(Ai,k)|x∈Ai) =1lP(x∈Si).

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters