Linear-time Learning on Distributions with Approximate Kernel Embeddings

# Linear-time Learning on Distributions with Approximate Kernel Embeddings

Dougal J. Sutherland111These two authors contributed equally.    Junier B. Oliva111These two authors contributed equally.    Barnabás Póczos    Jeff Schneider
Carnegie Mellon University
{dsutherl,joliva,bapoczos,schneide}@cs.cmu.edu
###### Abstract

Many interesting machine learning problems are best posed by considering instances that are distributions, or sample sets drawn from distributions. Previous work devoted to machine learning tasks with distributional inputs has done so through pairwise kernel evaluations between pdfs (or sample sets). While such an approach is fine for smaller datasets, the computation of an Gram matrix is prohibitive in large datasets. Recent scalable estimators that work over pdfs have done so only with kernels that use Euclidean metrics, like the distance. However, there are a myriad of other useful metrics available, such as total variation, Hellinger distance, and the Jensen-Shannon divergence. This work develops the first random features for pdfs whose dot product approximates kernels using these non-Euclidean metrics, allowing estimators using such kernels to scale to large datasets by working in a primal space, without computing large Gram matrices. We provide an analysis of the approximation error in using our proposed random features and show empirically the quality of our approximation both in estimating a Gram matrix and in solving learning tasks in real-world and synthetic data.

Linear-time Learning on Distributions with Approximate Kernel Embeddings

Dougal J. Sutherland111These two authors contributed equally.and Junier B. Oliva111These two authors contributed equally.and Barnabás Póczos and Jeff Schneider Carnegie Mellon University {dsutherl,joliva,bapoczos,schneide}@cs.cmu.edu

## Introduction

As machine learning matures, focus has shifted towards datasets with richer, more complex instances. For example, a great deal of effort has been devoted to learning functions on vectors of a large fixed dimension. While complex static vector instances are useful in a myriad of applications, many machine learning problems are more naturally posed by considering instances that are distributions, or sets drawn from distributions. Political scientists can learn a function from community demographics to vote percentages to understand who supports a candidate (?). The mass of dark matter halos can be inferred from the velocity of galaxies in a cluster (?). Expensive expectation propagation messages can be sped up by learning a “just-in-time” regression model (?). All of these applications are aided by working directly over sets drawn from the distribution of interest, rather than having to develop a per-problem ad-hoc set of summary statistics.

Distributions are inherently infinite-dimensional objects, since in general they require an infinite number of parameters for their exact representation. Hence, it is not immediate how to extend traditional finite vector technique machine learning techniques to distributional instances. However, recent work has provided various approaches for dealing with distributional data in a nonparametric fashion. For example, regression from distributional covariates to real or distributional responses is possible via kernel smoothing (??), and many learning tasks can be solved with rkhs approaches (??). A major shortcoming of both approaches is that they require computing kernel evaluations per prediction, where is the number of training instances in a dataset. Often, this implies that one must compute a Gram matrix of pairwise kernel evaluations. Such approaches fail to scale to datasets where the number of instances is very large. Another shortcoming of these approaches is that they are often based on Euclidean metrics, either working over a linear kernel, or one based on the distance over distributions. While such kernels are useful in certain applications, better performance can sometimes be obtained by considering non-Euclidean based kernels. To this end, ? (?) use a kernel based on Rényi divergences; however, this kernel is not positive semi-definite (psd), leading to even higher computational cost and other practical issues.

This work addresses these major shortcomings by developing an embedding of random features for distributions. The dot product of the random features for two distributions will approximate kernels based on various distances between densities (see Figure 1). With this technique, we can approximate kernels based on total variation, Hellinger, and Jensen-Shannon divergences, among others. Since there is then no need to compute a Gram matrix, one will be able to use these kernels while still scaling to datasets with a large number of instances using primal-space techniques. We provide an approximation bound for the embeddings, and demonstrate the efficacy of the embeddings on both real-world and synthetic data. To the best of our knowledge, this work provides the first non-discretized embedding for non- kernels for probability density functions.

## Related Work

The two main lines of relevant research are the development of kernels on probability distributions and explicit approximate embeddings for scalable kernel learning.

#### Learning on distributions

In computer vision, the popular “bag of words” model (?) represents a distribution by quantizing it onto codewords (usually by running -means on all, or many, of the input points from all sets), then compares those histograms with some kernel (often exponentiated ).

Another approach estimates a distance between distributions, often the distance or Kullback-Leibler (kl) divergence, parametrically (???) or nonparametrically (??). The distance can then be used in kernel smoothing (??) or Mercer kernels (????).

These approaches can be powerful, but usually require computing an matrix of kernel evaluations, which can be infeasible for large datasets. Using these distances in Mercer kernels faces an additional challenge, which is that the estimated Gram matrix may not be psd, due to estimation error or because some divergences in fact do not induce a psd kernel. In general this must be remedied by altering the Gram matrix a “nearby” psd one. Typical approaches involve eigendecomposing the Gram matrix, which usually costs computation and also presents challenges for traditional inductive learning, where the test points are not known at training time (?).

One way to alleviate the scaling problem is the Nyström extension (?), in which some columns of the Gram matrix are used to estimate the remainder. In practice, one frequently must compute many columns, and methods to make the result psd are known only for mildly-indefinite kernels (?).

Another approach is to represent a distribution by its mean rkhs embedding under some kernel . The rkhs inner product is known as the mean map kernel (mmk), and the distance the maximum mean discrepancy (mmd) (???). When is the common rbf kernel, the mmk estimate is proportional to an inner product between Gaussian kernel density estimates.

#### Approximate embeddings

Recent interest in approximate kernel embeddings was spurred by the “random kitchen sink” (rks) embedding (?), which approximates shift-invariant kernels on by sampling their Fourier transform.

A related line of work considers additive kernels, of the form , usually defined on (e.g. histograms). ? (?) construct an embedding for the intersection kernel via step functions. ? (?) consider any homogeneous , so that , which also allows them to embed histogram kernels such as the additive kernel and Jensen-Shannon divergence. Their embedding uses the same fundamental result of ? (?) as ours; we expand to the continuous rather than the discrete case. ? (?) later apply rks embeddings to obtain generalized rbf kernels Equation 1.

For embeddings of kernels on input spaces other than , the rks embedding extends naturally to locally compact abelian groups (?). ? (?) embedded an estimate of the distance between continuous densities via orthonormal basis functions. An embedding for the base kernel also gives a simple embedding for the mean map kernel (????).

## Embedding Information Theoretic Kernels

For a broad class of distributional distances , including many common and useful information theoretic divergences, we consider generalized rbf kernels of the form

 K(p,q)=exp(−12σ2d2(p,q)). (1)

We will construct features such that as follows:

1. We define a random function such that , where is a function from to . Thus the metric space of densities with distance is approximately embedded into the metric space of -dimensional functions.

2. We use orthonormal basis functions to approximately embed smooth functions into finite vectors in . Combined with the previous step, we obtain features such that is approximated by Euclidean distances between the features.

3. We use the rks embedding so that inner products between features, in , approximate .

We can thus approximate the powerful kernel without needing to compute an expensive Gram matrix.

### Homogeneous Density Distances (hdds)

We consider kernels based on metrics which we term homogeneous density distances (hdds):

 d2(p,q)=∫[0,1]ℓκ(p(x),q(x))dx, (2)

where is a negative-type kernel, i.e. a squared Hilbertian metric, and for all . Table 1 shows a few important instances. Note that we assume the support of the distributions is contained within .

We then use these distances in a generalized rbf kernel Equation 1. is a Hilbertian metric (?), so is positive definite (?). Note we use the metric, even though is itself a metric.

### Embedding hdds into L2

? (?) shows that corresponds to a bounded measure , as in Table 1, with

 κ(x,y)=∫R≥0|x12+\mathbbmiλ−y12+\mathbbmiλ|2dμ(λ). (3)

Let and ; then

 κ(x,y)=Eλ∼μZ|gλ(x)−gλ(y)|2

where .

We can approximate the expectation with an empirical mean. Let for ; then

 κ(x,y)≈1MM∑j=1|gλj(x)−gλj(y)|2.

Hence, using to denote the real and imaginary parts:

 d2(p,q) =∫[0,1]ℓκ(p(x),q(x))dx =∫[0,1]ℓEλ∼μZ|gλ(p(x))−gλ(q(x))|2dx ≈1MM∑j=1∫[0,1]ℓ((R(gλj(p(x)))−R(gλj(q(x))))2 +(I(gλj(p(x)))−I(gλj(q(x))))2)dx =∥ψ(p)−ψ(q)∥2, (4)

where is given by

 1√M(pRλ1(x),…,pRλM(x),pIλ1(x),…,pIλM(x)),

defining , . Hence, the hdd between densities and is approximately the distance from to , where maps a function to a vector-valued function of functions. can typically be quite small, since the kernel it approximates is one-dimensional.

### Finite Embeddings of L2

If densities and are smooth, then the metric between the and functions may be well approximated using projections to basis functions. Suppose that is an orthonormal basis for ; then we can construct an orthonormal basis for by the tensor product:

 {φα}α∈Zℓwhereφα(x)=ℓ∏i=1φαi(xi), x∈[0,1]ℓ, ∀f∈L2([0,1]ℓ), f(x)=∑α∈Zℓaα(f)φα(x)

and . Let be an appropriately chosen finite set of indices. If are smooth and , then . Thus we can approximate as the squared distance between finite vectors:

 d2(p,q) ≈∥ψ(p)−ψ(q)∥2 ≈1MM∑j=1∥→a(pRλj)−→a(qRλj)∥2+∥→a(pIλj)−→a(qIλj)∥2 =∥A(p)−A(q)∥2 (5)

where concatenates the features for each function. That is, is given by

 1√M(→a(pRλ1),…,→a(pRλM),→a(pIλ1),…,→a(pIλM)). (6)

We will discuss how to estimate , shortly.

### Embedding rbf Kernels into RD

The features approximate the hdd (2) in ; thus applying the rks embedding (?) to the features will approximate our generalized rbf kernel (1). The rks embedding is111There are two versions of the embedding in common use, but this one is preferred (?). such that for fixed and for each :

 z(x)Tz(y)≈exp(−12σ2∥x−y∥2), where z(x)=√2D(sin(ωT1x),cos(ωT1x),…). (7)

Thus we can approximate the hdd kernel (1) as:

 K(p,q) =exp(−12σ2d2(p,q)) ≈exp(−12σ2∥A(p)−A(q)∥2) ≈z(A(p))Tz(A(q)). (8)

### Finite Sample Estimates

Our final approximation for hdd kernels (8) depends on integrals of densities and . In practice, we are unlikely to directly observe an input density, but even given a pdf , the integrals that make up the elements of are not readily computable. We thus first estimate the density as , e.g. with kernel density estimation (kde), and estimate as . Recall that the elements of are:

 aα(^pSλj)=∫[0,1]ℓφα(t)^pSλj(t)dt (9)

where . In lower dimensions, we can approximate (9) with simple Monte Carlo numerical integration. Choosing :

 ^aα(^pSλj)=1nene∑i=1φα(ui)^pSλj(ui), (10)

obtaining . We note that in high dimensions, one may use any high-dimensional density estimation scheme (e.g. ? ?) and estimate (9) with mcmc techniques (e.g. ? ?).

### Summary and Complexity

The algorithm for computing features for a set of distributions , given sample sets where , is thus:

1. Draw scalars and vectors , in time.

2. For each of the input distributions :

1. Compute a kernel density estimate from , for each in (10), in time.

2. Compute using a numerical integration estimate as in (10), in time.

3. Get the rks features, , in time.

Supposing each , this process takes a total of time. Taking to be asymptotically , , and for simplicity, this is time, compared to for the methods of ? (?) and for ? (?).

## Theory

We bound for two fixed densities and by considering each source of error: kernel density estimation ; approximating with samples (); truncating the tails of the projection coefficients (); Monte Carlo integration (); and the rks embedding ().

We need some smoothness assumptions on and : that they are members of a periodic Hölder class , that they are bounded below by and above by , and that their kernel density estimates are in with probability at least . We use a suitable form of kernel density estimation, to obtain a uniform error bound with a rate based on the function (?). We use the Fourier basis and choose for parameters , .

Then, for any , the probability of the error exceeding is at most:

 2exp(−Dε2RKS)+2exp(−Mε4λ/(8Z2))+δ+2C−1⎛⎝ε4KDEn2β/(2β+ℓ)4logn⎞⎠+2M(1−μ([0,utail)))+8M|V|exp⎛⎜ ⎜ ⎜⎝−12ne⎛⎜ ⎜⎝√1+ε2int/(8|V|Z)−1√ρ∗+1⎞⎟ ⎟⎠2⎞⎟ ⎟ ⎟⎠

where .

The bound decreases when the function is smoother (larger , ; smaller ) or lower-dimensional (), or when we observe more samples (). Using more projection coefficients (higher or smaller , giving higher ) improves the approximation but makes numerical integration more difficult. Likewise, taking more samples from (higher ) improves that approximation, but increases the number of functions to be approximated and numerically integrated.

For the proof and further details, see the appendix.

## Numerical Experiments

Throughout these experiments we use , (selected as rules of thumb; larger values did not improve performance), and use a validation set (10% of the training set) to choose bandwidths for kde and the rbf kernel as well as model regularization parameters. Except in the scene classification experiments, the histogram methods used 10 bins per dimension; performance with other values was not better. The kl estimator used the fourth nearest neighbor.

We evaluate rbf kernels based on various distances. First, we try our JS, Hellinger, and TV embeddings. We compare to kernels as in ? (?): (L2). We also try the mmd distance (?) with approximate kernel embeddings: , where is the mean embedding (MMD). We further compare to rks with histogram js embeddings (?) (Hist JS); we also tried embeddings, but their performance was quite similar. We finally try the full Gram matrix approach of ? (?) with the kl estimator of ? (?) in an rbf kernel (KL), as did ? (?).

### Gram Matrix Estimation

We first illustrate that our embedding, using the parameter selections as above, can approximate the Jensen-Shanon kernel well. We compare three different approaches to estimating . Each approach uses kernel density estimates . The estimates are compared on a dataset of random gmm distributions and samples of size : . See the appendix for more details.

The first approach approximates js based on empirical estimates of entropies . The second approach estimates js as the Euclidean distance of vectors of projection coefficients (5) : . For these first two approaches we compute the pairwise kernel evaluations in the Gram matix as , and using their respective approximations for js. Lastly, we directly estimate the JS kernel with dot products of our random features (8): , with .

Figure 2 shows the true pairwise kernel values versus the aforementioned estimates. Quantitatively, the entropy method obtained a squared correlation to the true kernel value of ; using the features with an exact kernel yielded ; adding rks embeddings gave . Thus our method’s estimates are nearly as good as direct estimation via entropies, while allowing us to work in primal space and avoid Gram matrices.

### Estimating the Number of Mixture Components

We will now illustrate the efficacy of hdd random features in a regression task, following ? (?): estimate the number of components from a mixture of truncated Gaussians. We generate the distributions as follows: Draw the number of components for the th distribution as . For each component select a mean and covariance , where , , and is a diagonal matrix with . Then weight each component equally in the mixture. Given a sample , we predict the number of components . An example distribution and sample are shown in Figure 3; predicting the number of components is difficult even for humans.

Figure 4 presents results for predicting with ridge regression the number of mixture components , given a varying number of sample sets , with ; we use . The hdd-based kernels achieve substantially lower error than the and mmd kernels. They also outperform the histogram kernel, especially with , and the kl kernel. Note that fitting mixtures with em and selecting a number of components using aic (?) or bic (?) performed much worse than regression; only aic with outperformed a constant predictor of . Linear versions of the and mmd kernels were also no better than the constant predictor.

The hdd embeddings were more computationally expensive than the other embeddings, but much less expensive than the kl kernel, which grows at least quadratically in the number of distributions. Note that the histogram embeddings used an optimized C implementation (?), as did the kl kernel, while the hdd embeddings used a simple Matlab implementation.

### Image Classification

As another example of the performance of our embeddings, we now attempt to classify images based on their distributions of pixel values. We took the “cat” and “dog” classes from the cifar-10 dataset (?), and represented each image by a set of triples , where and are the position of each pixel in the image and the pixel value after converting to grayscale. The horizontal reflection of the image was also included, so each sample set had . This is certainly not the best representation for these images; rather, we wish to show that given this simple representation, our hdd kernels perform well relative to the other options.

We used the same kernels as above in an svm classifier from liblinear (? ?, for the embeddings) or libsvm (? ?, for the kl kernel), with . Figure 6 shows computation time and accuracy on the standard test set (of size 2k) with 2.5k, 5k, and 10k training images. Our js and Hellinger embedding approximately match the histogram js embedding in accuracy here, while our tv embedding beats histogram js; all outperform and mmd. We could only run the kl kernel for the 2.5k training set size; its accuracy was comparable to the hdd and histogram embeddings, at far higher computational cost.

### Scene Classification

Modern computer vision classification systems typically consist of a deep network with several convolutional and pooling layers to extract complex features of input images, followed by one or two fully-connected classification layers. The activations are of shape , where is the number of filters; each unit corresponds to an overlapping patch of the original image. We can thus treat the final pooled activations as a sample of size from an -dimensional distribution, similarly to how ? (?) and ? (?) used sift features from image patches. ? (?) set accuracy records on several scene classification datasets with a particular ad-hoc method of extracting features from distributions (D3); we compare to our more principled alternatives.

We consider the Scene-15 dataset (?), which contains natural images in 15 location categories, and follow ? in extracting features from the last convolutional layer of the imagenet-vgg-verydeep-16 model (?). We replace that layer’s rectified linear activations with sigmoid squashing to .333 We used piecewise-linear weights before the sigmoid function such that maps to , the 90th percentile of the positive observations maps to , and the 10th percentile of the negative observations to , for each filter. ranges from to . There are 512 filter dimensions; we concatenate features extracted from each independently.

We train on the standard for this dataset of 100 images from each class (1500 total) and test on the remainder; Figure 6 shows results. We do not add any spatial information to the model; still, we match the best prior published performance of , which trained on over 7 million external images (?). Adding spatial information brought the D3 method slightly above 92% accuracy; their best hybrid method obtained 92.9%. Using these features, however, our methods match or beat mmd and substantially outperform D3, , and the histogram embeddings.

## Discussion

This work presents the first nonlinear embedding of density functions for quickly computing hdd-based kernels, including kernels based on the popular total variation, Hellinger and Jensen-Shanon divergences. While such divergences have shown good empirical results in the comparison of densities, nonparametric uses of kernels with these divergences previously necessitated the computation of a large Gram matrix, prohibiting their use in large datasets. Our embeddings allow one to work in a primal space while using information theoretic kernels. We analyze the approximation error of our embeddings, and illustrate their quality on several synthetic and real-world datasets.

## References

• [Akiake 1973] Akiake, H. 1973. Information theory and an extension of the maximum likelihood principle. In 2nd Int. Symp. on Inf. Theory.
• [Belongie et al. 2002] Belongie, S.; Fowlkes, C.; Chung, F.; and Malik, J. 2002. Spectral partitioning with indefinite kernels using the Nyström extension. In ECCV.
• [Chang and Lin 2011] Chang, C.-C., and Lin, C.-J. 2011. LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3):1–27.
• [Chen et al. 2009] Chen, Y.; Garcia, E. K.; Gupta, M. R.; Rahimi, A.; and Cazzanti, L. 2009. Similarity-based classification: Concepts and algorithms. JMLR 10:747–776.
• [Fan et al. 2008] Fan, R.-E.; Chang, K.-W.; Hsieh, C.-J.; Wang, X.-R.; and Lin, C.-J. 2008. LIBLINEAR: A library for large linear classification. JMLR 9:1871–1874.
• [Flaxman, Wang, and Smola 2015] Flaxman, S. R.; Wang, Y.-x.; and Smola, A. J. 2015. Who supported Obama in 2012? Ecological inference through distribution regression. In KDD, 289–298.
• [Fuglede 2005] Fuglede, B. 2005. Spirals in Hilbert space: With an application in information theory. Exposition. Math. 23(1):23–45.
• [Giné and Guillou 2002] Giné, E., and Guillou, A. 2002. Rates of strong uniform consistency for multivariate kernel density estimators. Ann. Inst. H. Poincaré Probab. Statist. 38(6):907–921.
• [Gretton et al. 2009] Gretton, A.; Fukumizu, K.; Harchaoui, Z.; and Sriperumbudur, B. K. 2009. A fast, consistent kernel two-sample test. In NIPS.
• [Haasdonk and Bahlmann 2004] Haasdonk, B., and Bahlmann, C. 2004. Learning with distance substitution kernels. In Pattern Recognition: 26th DAGM Symposium, 220–227.
• [Hoffman and Gelman 2014] Hoffman, M. D., and Gelman, A. 2014. The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. JMLR 15(1):1593–1623.
• [Jaakkola and Haussler 1998] Jaakkola, T., and Haussler, D. 1998. Exploiting generative models in discriminative classifiers. In NIPS.
• [Jebara, Kondor, and Howard 2004] Jebara, T.; Kondor, R.; and Howard, A. 2004. Probability product kernels. JMLR 5:819–844.
• [Jitkrittum et al. 2015] Jitkrittum, W.; Gretton, A.; Heess, N.; Eslami, S.; Lakshminarayanan, B.; Sejdinovic, D.; and Szabó, Z. 2015. Kernel-based just-in-time learning for passing expectation propagation messages. UAI.
• [Kondor and Jebara 2003] Kondor, R., and Jebara, T. 2003. A kernel between sets of vectors. In ICML.
• [Krishnamurthy et al. 2014] Krishnamurthy, A.; Kandasamy, K.; Poczos, B.; and Wasserman, L. 2014. Nonparametric estimation of Rényi divergence and friends. In ICML.
• [Krizhevsky and Hinton 2009] Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images. University of Toronto, Tech. Rep.
• [Lafferty, Liu, and Wasserman 2012] Lafferty, J.; Liu, H.; and Wasserman, L. 2012. Sparse nonparametric graphical models. Statistical Science 27(4):519–537.
• [Lazebnik, Schmid, and Ponce 2006] Lazebnik, S.; Schmid, C.; and Ponce, J. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.
• [Leung and Malik 2001] Leung, T., and Malik, J. 2001. Representing and recognizing the visual appearance of materials using three-dimensional textons. IJCV 43.
• [Li, Ionescu, and Sminchisescu 2010] Li, F.; Ionescu, C.; and Sminchisescu, C. 2010. Random Fourier approximations for skewed multiplicative histogram kernels. In Pattern Recognition: DAGM, 262–271.
• [Lin 1991] Lin, J. 1991. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1):145–151.
• [Lopez-Paz et al. 2015] Lopez-Paz, D.; Muandet, K.; Schölkopf, B.; and Tolstikhin, I. 2015. Towards a learning theory of causation. ICML.
• [Maji and Berg 2009] Maji, S., and Berg, A. C. 2009. Max-margin additive classifiers for detection. In ICCV.
• [Moreno, Ho, and Vasconcelos 2003] Moreno, P. J.; Ho, P. P.; and Vasconcelos, N. 2003. A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications. In NIPS.
• [Muandet et al. 2012] Muandet, K.; Fukumizu, K.; Dinuzzo, F.; and Schölkopf, B. 2012. Learning from distributions via support measure machines. In NIPS.
• [Ntampaka et al. 2015] Ntampaka, M.; Trac, H.; Sutherland, D. J.; Battaglia, N.; Póczos, B.; and Schneider, J. 2015. A machine learning approach for dynamical mass measurements of galaxy clusters. The Astrophysical Journal 803(2):50.
• [Oliva et al. 2014] Oliva, J. B.; Neiswanger, W.; Póczos, B.; Schneider, J.; and Xing, E. 2014. Fast distribution to real regression. In AISTATS.
• [Oliva, Póczos, and Schneider 2013] Oliva, J. B.; Póczos, B.; and Schneider, J. 2013. Distribution to distribution regression. In ICML.
• [Póczos et al. 2012a] Póczos, B.; Rinaldo, A.; Singh, A.; and Wasserman, L. 2012a. Distribution-free distribution regression. AISTATS.
• [Póczos et al. 2012b] Póczos, B.; Xiong, L.; Sutherland, D. J.; and Schneider, J. 2012b. Nonparametric kernel estimators for image classification. In CVPR.
• [Rahimi and Recht 2007] Rahimi, A., and Recht, B. 2007. Random features for large-scale kernel machines. In NIPS.
• [Schwarz 1978] Schwarz, G. 1978. Estimating the dimension of a model. Ann. Statist. 6(2):461–464.
• [Simonyan and Zisserman 2015] Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.
• [Sricharan, Wei, and Hero 2013] Sricharan, K.; Wei, D.; and Hero, III, A. O. 2013. Ensemble estimators for multivariate entropy estimation. IEEE Trans. Inf. Theory 59:4374–4388.
• [Sutherland and Schneider 2015] Sutherland, D. J., and Schneider, J. 2015. On the error of random Fourier features. In UAI.
• [Szabó et al. 2015] Szabó, Z.; Gretton, A.; Póczos, B.; and Sriperumbudur, B. 2015. Two-stage sampled learning theory on distributions. AISTATS.
• [Vedaldi and Fulkerson 2008] Vedaldi, A., and Fulkerson, B. 2008. VLFeat: An open and portable library of computer vision algorithms.
• [Vedaldi and Zisserman 2010] Vedaldi, A., and Zisserman, A. 2010. Efficient additive kernels via explicit feature maps. In CVPR.
• [Vempati et al. 2010] Vempati, S.; Vedaldi, A.; Zisserman, A.; and Jawahar, C. V. 2010. Generalized RBF feature maps for efficient detection. In British Machine Vision Conference.
• [Wang, Kulkarni, and Verdú 2009] Wang, Q.; Kulkarni, S. R.; and Verdú, S. 2009. Divergence estimation for multidimensional densities via -nearest-neighbor distances. IEEE Trans. Inf. Theory 55(5):2392–2405.
• [Williams and Seeger 2001] Williams, C. K. I., and Seeger, M. 2001. Using the Nyström method to speed up kernel machines. In NIPS.
• [Wu, Gao, and Liu 2015] Wu, J.; Gao, B.-B.; and Liu, G. 2015. Visual recognition using directional distribution distance.
• [Zhou et al. 2014] Zhou, B.; Lapedriza, A.; Xiao, J.; Torralba, A.; and Oliva, A. 2014. Learning deep features for scene recognition using Places database. In NIPS.

## Acknowledgements

This work was funded in part by NSF grant IIS1247658 and by DARPA grant FA87501220324. DJS is also supported by a Sandia Campus Executive Program fellowship.

## Appendix A Appendices

### Gram Matrix Estimation

We illustrate our embedding’s ability to approximate the Jensen-Shanon divergence. In the examples below the densities considered are mixtures of five equally weighted truncated spherical Gaussians on . That is,

 pi(x)=155∑j=1Nt(mij,diag(s2ij))

where , and is the distribution of a Gaussian truncated on with mean parameter and covariance matrix parameter . We work over the sample set , where , , .

We compare three different approaches to estimating . Each approach uses density estimates , which are computed using kernel density estimation. The first approach is based on estimating js using empirical estimates of entropies:

 JS(pi,pj) =−12Epi[log(1pi(x))]−12Epj[log(1pj(x))] +E12pi+12pj[log(2pi(x)+pj(x))] +12⌈n/2⌉∑m=1log⎛⎝2^pi(X(i)m)+^pj(X(i)m)⎞⎠ +12⌈n/2⌉∑m=1log⎛⎝2^pi(X(j)m)+^pj(X(k)m)⎞⎠ =JSent(pi,pj),

where density estimates above are based on points to avoid biasing the empirical means. The second approach estimates js as the Euclidean distance of vectors of projection coefficients:

 JS(pi,pj) ≈∥^A(^pi)−^A(^pj)∥2=JSpc(pi,pj),

where here the density estimates are based on the entire set of points . We build a Gram matrix for each of these approaches by setting and . Lastly, we directly estimate the js kernel with random features:

 Grksij=z(^A(^pi))Tz(^A(^pj)).

We compare the effectiveness of each approach by computing the score of the estimates produced versus a true js kernel value computed through numerically integrating the true densities (see Tables 2 and 7). The rbf values estimated with our random features produce estimates that are nearly as good as directly estimating JS divergences through entropies, whilst allowing us to work over a primal space and thus avoid computing a Gram matrix for learning tasks.

### Proofs

We will now prove the bound on the error probability of our embedding for fixed densities and .

#### Setup

We will need a few assumptions on the densities:

1. and are bounded above and below: for , .

2. for some . refers to the Hölder class of functions whose partial derivatives up to order are continuous and whose th partial derivatives, where is a multi-index of order , satisfy . Here is the greatest integer strictly less than .

3. , are periodic.

These are fairly standard smoothness assumptions in the nonparametric estimation literature.

Let . If , then for some ; otherwise, clearly . Then, from assumption 3, , the periodic Hölder class. We’ll need this to establish the Sobolev ellipsoid containing and .

We will use kernel density estimation with a bounded, continuous kernel so that the bound of ? (?) applies, with bandwidth , and truncating density estimates to .

We also use the Fourier basis , and define as the set of indices s.t. for parameters , to be discussed later.

#### Decomposition

Let . Then

 ∣∣K(p,q)−z(^A(^p))Tz(^A(^q))∣∣≤∣∣K(p,q)−rσk(∥^A(^p)−^A(^q)∥)∣∣+∣∣rσk(∥^A(^p)−^A(^q)∥)−z(^A(^p))Tz(^A(^q))∣∣.

The latter term was bounded by ? (?). For the former, note that is -Lipschitz, so the first term is at most . Breaking this up with the triangle inequality:

 ∣∣d(p,q)−∥^A(^p)−^A(^q)∥∣∣≤|d(p,q)−d(^p,^q)|+|d(^p,^q)−∥ψ(^p)−ψ(^q)∥|+|∥ψ(^p)−ψ(^q)∥−∥A(^p)−A(^q)∥|+∣∣∥A(^p)−A(^q)∥−∥^A(^p)−^A(^q)∥∣∣. (11)

#### Estimation error

Recall that is a metric, so the reverse triangle inequality allows us to address the first term with

 |d(p,q)−d(^p,^q)|≤d(p,^p)+d(q,^q).

For the total variation, squared Hellinger, or Jensen-Shannon hdds, we have that (?). Moreover, as the distributions are supported on , .

It is a consequence of ? (?) that, for any , for some depending on the kernel. Thus , where .

#### λ approximation

The second term of (11), the approximation error due to sampling s, admits a simple Hoeffding bound. Note that , viewed as a random variable in only, has expectation and is bounded by (where ): write it as , expand the square, and use (via Cauchy-Schwarz).

For nonnegative random variables and , , so we have that is at most .

#### Tail truncation error

The third term of (11), the error due to truncating the tail projection coefficients of the functions, requires a little more machinery. First note that is at most

 M∑j=1∑S=R,I∑α∉V∣∣aα(^pS