Comparing Distributions and Shapes using the Kernel Distance

Starting with a similarity function between objects, it is possible to define a distance metric on pairs of objects, and more generally on probability distributions over them. These distance metrics have a deep basis in functional analysis, measure theory and geometric measure theory, and have a rich structure that includes an isometric embedding into a (possibly infinite dimensional) Hilbert space. They have recently been applied to numerous problems in machine learning and shape analysis.

In this paper, we provide the first algorithmic analysis of these distance metrics. Our main contributions are as follows: (i) We present fast approximation algorithms for computing the kernel distance between two point sets and that runs in near-linear time in the size of (note that an explicit calculation would take quadratic time). (ii) We present polynomial-time algorithms for approximately minimizing the kernel distance under rigid transformation; they run in time . (iii) We provide several general techniques for reducing complex objects to convenient sparse representations (specifically to point sets or sets of points sets) which approximately preserve the kernel distance. In particular, this allows us to reduce problems of computing the kernel distance between various types of objects such as curves, surfaces, and distributions to computing the kernel distance between point sets. These take advantage of the reproducing kernel Hilbert space and a new relation linking binary range spaces to continuous range spaces with bounded fat-shattering dimension.

## 1 Introduction

Let be a kernel function, such as a Gaussian kernel; describes how similar two points are. For point sets we can define a similarity function . Then the kernel distance is defined as

 DK(P,Q)=√κ(P,P)+κ(Q,Q)−2κ(P,Q). (1.1)

By altering the kernel , and the weighting of elements in , the kernel distance can capture distance between distributions, curves, surfaces, and even more general objects.

#### Motivation.

The earthmover distance (EMD) takes a metric space and two probability distributions over the space, and computes the amount of work needed to "transport" mass from one distribution to another. It has become a metric of choice in computer vision, where images are represented as intensity distributions over a Euclidean grid of pixels. It has also been applied to shape comparison [8], where shapes are represented by point clouds (discrete distributions) in space.

While the EMD is a popular way of comparing distributions over a metric space, it is also an expensive one. Computing the EMD requires solving an optimal transportation problem via the Hungarian algorithm, and while approximations exist for restricted cases like the Euclidean plane, they are still expensive, requiring either quadratic time in the size of the input or achieving at best a constant factor approximation when taking linear time [38, 2]. Further, it is hard to index structures using the EMD for performing near-neighbor, clustering and other data analysis operations. Indeed, there are lower bounds on our ability to embed the EMD into well-behaved normed spaces [4].

The kernel distance has thus become an effective alternative to comparing distributions on a metric space. In machine learning, the kernel distance has been used to build metrics on distributions [42, 23, 41, 39, 32] and to learn hidden Markov models [40]. In the realm of shape analysis [45, 18, 19, 17], the kernel distance (referred to there as the current distance) has been used to compare shapes, whether they be point sets, curves, or surfaces.

All of these methods utilize key structural features of the kernel distance. When constructed using a positive definite111A positive definite function generalizes the idea of a positive definite matrix; see Section 2. similarity function , the kernel distance can be interpreted through a lifting map to a reproducing kernel Hilbert space (RKHS), . This lifting map is isometric; the kernel distance is precisely the distance induced by the Hilbert space (). Furthermore, a point set has an isometric lifted representation as a single vector in so . Moreover, by choosing an appropriately scaled basis, this becomes a simple distance, so all algorithmic tools developed for comparing points and point sets under can now be applied to distributions and shapes.

Dealing with uncertain data provides another reason to study the kernel distance. Rather than thinking of as a similarity function, we can think of it as a way of capturing spatial uncertainty; is the likelihood that the object claimed to be at is actually at . For example, setting gives us a Gaussian blur function. In such settings, the kernel distance computes the symmetric difference between shapes with uncertainty described by .

#### Our work.

We present the first algorithmic analysis of the kernel distance. Our main contributions are as follows: (i) We present fast approximation algorithms for computing the kernel distance between two point sets and that runs in near-linear time in the size of (note that an explicit calculation would take quadratic time). (ii) We present polynomial-time algorithms for approximately minimizing the kernel distance under rigid transformation; they run in time . (iii) We provide several general techniques for reducing complex objects to convenient sparse representations (specifically to point sets or sets of points sets) which approximately preserve the kernel distance. In particular, this allows us to reduce problems of computing the kernel distance between various types of objects such as curves, surfaces, and distributions to computing the kernel distance between point sets.

We build these results from two core technical tools. The first is a lifting map that maps objects into a finite-dimensional Euclidean space while approximately preserving the kernel distance. We believe that the analysis of lifting maps is of independent interest; indeed, these methods are popular in machine learning [41, 39, 40] but (in the realm of kernels) have received less attention in algorithms. Our second technical tool is an theorem relating -samples of range spaces defined with kernels to standard -samples of range spaces on -valued functions. This gives a simpler algorithm than prior methods in learning theory that make use of the -fat shattering dimension, and yields smaller -samples.

## 2 Preliminaries

#### Definitions.

For the most general case of the kernel distance (that we will consider in this paper) we associate a unit vector and a weighting with every . Similarly we associate a unit vector and weighting with every . Then we write

 κ(P,Q)=∫p∈P∫q∈QK(p,q)⟨U(p),V(q)⟩dμ(p)dν(q), (2.1)

where is the Euclidean inner product. This becomes a distance , defined through (1.1).

When is a curve in we let be the tangent vector at and . When is a surface in we let be the normal vector at and . This can be generalized to higher order surfaces through the machinery of -forms and -vectors [45, 35].

When is an arbitrary probability measure222We avoid the use of the term ’probability distribution’ as this conflicts with the notion of a (Schwarz) distribution that itself plays an important role in the underlying theory. in , then all are identical unit vectors and is the probability of . For discrete probability measures, described by a point set, we replace the integral with a sum and can be used as a weight .

#### From distances to metrics.

When is a symmetric similarity function (i.e. ,   , and decreases as and become “less similar”) then (defined through (2.1) and (1.1)) is a distance function, but may not be a metric. However, when is positive definite, then this is sufficient for to not only be a metric333Technically this is not completely correct; there are a few special cases, as we will see, where it is a pseudometric [41]., but also for to be of negative type [16].

We say that a symmetric function is a symmetric positive definite kernel if for any nonzero function it satisfies The proof of being a metric follows by considering the reproducing kernel Hilbert space associated with such a  [5]. Moreover, can be expressed very compactly in this space. The lifting map associated with has the “reproducing property” . So by linearity of the inner product, can be used to retrieve using the induced norm of . Observe that this defines a norm for a shape.

#### Examples.

If is the “trivial” kernel, where and for , then the distance between any two sets (without multiplicity) is , where is the symmetric difference. In general for arbitrary probability measures, . If , the Euclidean dot product, then the resulting lifting map is the identity function, and the distance between two measures is the Euclidean distance between their means, which is a pseudometric but not a metric.

#### Gaussian properties.

To simplify much of the presentation of this paper we focus on the case where the kernel is the Gaussian kernel; that is . Our techniques carry over to more general kernels, although the specific bounds will depend on the kernel being used. We now encapsulate some useful properties of Gaussian kernels in the following lemmata. When approximating , the first allows us to ignore pairs of points further that apart, the second allows us to approximate the kernel on a grid.

If then .

###### Lemma 2.2 (Lipschitz).

For where , for points we have .

###### Proof.

The slope for is the function . is maximized when , which yields . Thus . And since translating by changes by at most , the lemma holds. ∎

### 2.1 Problem Transformations

In prior research employing the kernel distance, ad hoc discretizations are used to convert the input objects (whether they be distributions, point clouds, curves or surfaces) to weighted discrete point sets. This process introduces error in the distance computations that usually go unaccounted for. In this subsection, we provide algorithms and analysis for rigorously discretizing input objects with guarantees on the resulting error. These algorithms, as a side benefit, provide a formal error-preserving reduction from kernel distance computation on curves and surfaces to the corresponding computations on discrete point sets.

After this section, we will assume that all data sets considered are discrete point sets of size with weight functions and . The weights need not sum to (i.e. need not be probability measures), nor be the same for and ; we will set to denote the total measure. This implies (since ) that . All our algorithms will provide approximations of within additive error . Since without loss of generality we can always normalize so that , our algorithms all provide an additive error of . We also set to capture the normalized diameter of the data.

#### Reducing orientation to weights.

The kernel distance between two oriented curves or surfaces can be reduced to a set of distance computations on appropriately weighted point sets. We illustrate this in the case of surfaces in . The same construction will also work for curves in .

For each point we can decompose into three fixed orthogonal components such as the coordinate axes . Now

 κ(P,Q)= ∫p∈P∫q∈QK(p,q)⟨U(p),V(q)⟩dμ(p)dν(q) =∫p∈P∫q∈QK(p,q)3∑i=1(Ui(p)Vi(q))dμ(p)dν(q) = 3∑i=1∫p∈P∫q∈QK(p,q)(Ui(p)Vi(q))dμ(p)dν(q) =3∑i=1κ(Pi,Qi),

where each has measure . When the problem specifies as a unit vector in , this approach reduces to independent problems without unit vectors.

#### Reducing continuous to discrete.

We now present two simple techniques (gridding and sampling) to reduce a continuous to a discrete point set, incurring at most error.

We construct a grid (of size ) on a smooth shape , so no point444For distributions with decaying but infinite tails, we can truncate to ignore tails such that the integral of the ignored area is at most and proceed with this approach using instead of . is further than from a point . Let be all points in closer to than any other point in . Each point is assigned a weight . The correctness of this technique follows by Lemma 2.2.

Alternatively, we can sample points at random from . If we have not yet reduced the orientation information to weights, we can generate points each with weight . This works with probability at least by invoking a coreset technique summarized in Theorem 5.2.

For the remainder of the paper, we assume our input dataset is a weighted point set in of size .

## 3 Computing the Kernel Distance I: WSPDs

The well-separated pair decomposition (WSPD) [9, 21] is a standard data structure to approximately compute pairwise sums of distances in near-linear time. A consequence of Lemma 2.2 is that we can upper bound the error of estimating by a nearby pair . Putting these observations together yields (with some work) an approximation for the kernel distance. Since , the problem reduces to computing efficiently and with an error of at most .

Two sets and are said to be -separated [9] if . Let denote the set of all unordered pairs of elements formed by and . An -WSPD of a point set is a set of pairs such that

• for all ,

• for all ,

• disjointly , and

• and are -separated for all .

For a point set of size , we can construct an -WSPD of size in time  [21, 14].

We can use the WSPD construction to compute as follows. We first create an -WSPD of in time. Then for each pair we also store four sets , , , and . Let and be arbitrary elements, and let . By construction, approximates the distance between any pair of elements in with error at most .

In each pair , we can compute the weight of the edges from to :

 Wi=(∑p∈Ai,Pμ(p))(∑q∈Bi,Qν(q))+(∑q∈Ai,Qν(q))(∑p∈Bi,Pμ(p)).

We estimate the contribution of the edges in pair to as

 ∑(a,b)∈Ai,P×Bi,Qμ(a)ν(b)e−D2i/h+∑(a,b)∈Ai,Q×Bi,Pμ(b)ν(a)e−D2i/h=Wie−D2i/h.

Since has error at most for each pair of points, Lemma 2.2 bounds the error as at most .

In order to bound the total error to , we bound the error for each pair by since . By Lemma 2.1, if , then . So for any pair with , (for ) we can ignore, because they cannot have an effect on of more than , and thus cannot have error more than .

Since we can ignore pairs with , each pair will have error at most . We can set this equal to and solve for . This ensures that each pair with will have error less than , as desired.

###### Theorem 3.1.

By building and using an -WSPD, we can compute a value in time , such that

 ∣∣U−D2K(P,Q)∣∣≤εW2.

## 4 Computing the Kernel Distance II: Approximate Feature Maps

In this section, we describe (approximate) feature representations for shapes and distributions that reduce the kernel distance computation to an distance calculation in an RKHS, . This mapping immediately yields algorithms for a host of analysis problems on shapes and distributions, by simply applying Euclidean space algorithms to the resulting feature vectors.

The feature map allows us to translate the kernel distance (and norm) computations into operations in a RKHS that take time if has dimension , rather than the brute force time . Unfortunately, is in general infinite dimensional, including the case of the Gaussian kernel. Thus, we use dimensionality reduction to find an approximate mapping (where ) that approximates :

 ∣∣∣∑p∈P∑q∈QK(p,q)μ(p)ν(q)−∑p∈P∑q∈Q⟨~ϕ(p),~ϕ(q)⟩∣∣∣≤εW2.

The analysis in the existing literature on approximating feature space does not directly bound the dimension required for a specific error bound555Explicit matrix versions of the Johnson-Lindenstraus lemma [24] cannot be directly applied because the source space is itself infinite dimensional, rather than .. We derive bounds from two known techniques: random projections [36] (for shift-invariant kernels, includes Gaussians) and the Fast Gauss Transform [48, 20] (for Gaussian kernel). We produce three different features maps, with different bounds on the number of dimensions depending on ( is the number of points), (the error), (the probability of failure), (the normalized diameter of the points), and/or (the ambient dimension of the data before the map).

### 4.1 Random Projections Feature Space

Rahimi and Recht [36] proposed a feature mapping that essentially applies an implicit Johnson-Lindenstrauss projection from . The approach works for any shift invariant kernel (i.e one that can be written as ). For the Gaussian kernel, , where . Let the Fourier transform of is . A basic result in harmonic analysis [37] is that is a kernel if and only if is a measure (and after scaling, is a probability distribution). Let be drawn randomly from the distribution defined by :

 k(x−y)=∫ω∈\mdmathbbRdg(ω)eι⟨ω,x−y⟩dω=E[⟨(x),(y)⟩],

where are the real and imaginary components of . This implies that is an unbiased estimator of .

We now consider a -dimensional feature vector where the th and th coordinates are described by for some drawn randomly from . Next we prove a bound on using this construction.

###### Lemma 4.1.

For with , with probability

 ∣∣∣∑p∈P∑q∈QK(p,q)μ(p)ν(q)−∑p∈P∑q∈Q⟨(p),(q)⟩∣∣∣≤εW2.
###### Proof.

We make use of the following Chernoff-Hoeffding bound. Given a set of independent random variables, such that , then for we can bound . We can now bound the error of using for any pair as follows:

 Pr[|⟨(p),(q)⟩−μ(p)ν(q)k(p−q)|≥εμ(p)ν(q)] = ≤ Pr[∣∣ ∣∣∑i2ρμ(p)ν(q)⟨i(p),i(q)⟩−E[∑i2ρμ(p)ν(q)⟨i(p),i(q)⟩]∣∣ ∣∣≥εμ(p)ν(q)] ≤ 2e−2(εμ(p)ν(q))2/(ρ2/2)≤2e−ρ2/64,

where the last inequality follows by since for each pair of coordinates for all (or ). By the union bound, the probability that this holds for all pairs of points is given by

 Pr[∀(p,q)∈P×Q|⟨(p),(q)⟩−μ(p)ν(q)k(p−q)|≥εμ(p)ν(q)]≤(n2)2e−ρ2/64.

Setting and solving for yields that for , with probability at least , for all we have . It follows that with probability at least

 ∣∣∣∑p∈P∑q∈Qμ(p)ν(q)K(p,q)−∑p∈P∑q∈Q⟨(p),(q)⟩∣∣∣≤ε∑p∈P∑q∈Qμ(p)ν(q)≤εW2.\qed

Note that the analysis of Rahimi and Recht [36] is done for unweighted point sets (i.e. ) and actually goes further, in that it yields a guarantee for any pair of points taken from a manifold having diameter . They do this by building an -net over the domain and applying the above tail bounds to the -net. We can adapt this trick to replace the term in by a term, recalling . This leads to the same guarantees as above with a dimension of .

### 4.2 Fast Gauss Transform Feature Space

The above approach works by constructing features in the frequency domain. In what follows, we present an alternative approach that operates in the spatial domain directly. We base our analysis on the Improved Fast Gauss Transform (IFGT) [48], an improvement on the Fast Gauss Transform. We start with a brief review of the IFGT (see the original work [48] for full details).

#### IFGT feature space construction.

The goal of the IFGT is to approximate . First we rewrite as the summation where . Next, we approximate in two steps. First we rewrite

 G(q)=ν(q)∑p∈Pμ(p)e−∥q−x∗∥2h2e−∥p−x∗∥2h2e2∥q−x∗∥⋅∥p−x∗∥h2,

where the quantity is a fixed vector that is usually the centroid of . The first two exponential terms can be computed for each and once. Second, we approximate the remaining exponential term by its Taylor expansion . After a series of algebraic manipulations, the following expression emerges:

 G(q)=ν(q)e−∥q−x∗∥2h2∑α≥0C(q−x∗h)

where is given by

 C=2|α|α!∑p∈Pμ(p)e−∥p−x∗∥2h2(p−x∗h).

The parameter is a multiindex, and is actually a vector of dimension . The expression , for , denotes the monomial , the quantity is the total degree , and the quantity . The multiindices are sorted in graded lexicographic order, which means that comes before if , and two multiindices of the same degree are ordered lexicographically.

The above expression for is an exact infinite sum, and is approximated by truncating the summation at multiindices of total degree . Note that there are at most such multiindices. We now construct a mapping . Let Then

 G(q)=∑~ϕ(q)∑p∈P~ϕ(p)

and is then given by

 S=∑p∈P∑q∈Q∑~ϕ(q)~ϕ(p)=∑p∈P∑q∈Q⟨~ϕ(q),~ϕ(p)⟩.

#### IFGT error analysis.

The error incurred by truncating the sum at degree is given by

 Err(τ)=∣∣∑p∈P∑q∈QK(p,q)μ(p)ν(q)−∑p∈P∑q∈Q⟨~ϕ(p),~ϕ(q)⟩∣∣≤∑p∈P∑q∈Qμ(p)ν(q)2τ!2τ=W22τ!2τ.

Set . Applying Stirling’s approximation, we solve for in . This yields the bounds and . Thus our error bound holds for . Using , we obtain the following result.

###### Lemma 4.2.

There exists a mapping with so

 ∣∣∣∑p∈P∑q∈QK(p,q)μ(p)ν(q)−∑p∈P∑q∈Q⟨~ϕ(p),~ϕ(q)⟩∣∣∣≤εW2.

### 4.3 Summary of Feature Maps

We have developed three different bounds on the dimension required for feature maps that approximate to within .

• . Lemma 4.2. Advantages: deterministic, independent of , logarithmic dependence on . Disadvantages: polynomial dependence on , exponential dependence on .

• . Lemma 4.1. Advantages: independent of and . Disadvantages: randomized, dependent on , polynomial dependence on .

• . (above) Advantages: independent of , logarithmic dependence on , polynomial dependence on . Disadvantages: randomized, dependence on and , polynomial dependence on .

For simplicity, we (mainly) use the Random-points based result from Lemma 4.1 in what follows. If appropriate in a particular application, the other bounds may be employed.

#### Feature-based computation of Dk.

As before, we can decompose and use Lemma 4.1 to approximate each of , and with error .

###### Theorem 4.1.

We can compute a value in time such that , with probability at least .

#### A nearest-neighbor algorithm.

The feature map does more than yield efficient algorithms for the kernel distance. As a representation for shapes and distributions, it allows us to solve other data analysis problems on shape spaces using off-the-shelf methods that apply to points in Euclidean space. As a simple example of this, we can combine the Random-points feature map with known results on approximate nearest-neighbor search in Euclidean space [3] to obtain the following result.

###### Lemma 4.3.

Given a collection of point sets , and a query surface , we can compute the -approximate nearest neighbor to in under the kernel distance in time query time using space and preprocessing.

## 5 Coresets for the Kernel Distance

The kernel norm (and distance) can be approximated in near-linear time; however, this may be excessive for large data sets. Rather, we extract a small subset (a coreset) from the input such that the kernel distance between and is small. By triangle inequality, can be used as a proxy for . Specifically, we extend the notion of -samples for range spaces to handle non-binary range spaces defined by kernels.

#### Background on range spaces.

Let denote the total weight of a set of points , or cardinality if no weights are assigned. Let be a set of points and let be a family of subsets of . For examples of , let denote the set of all subsets defined by containment in a ball and let denote the set of all subsets defined by containment in ellipses. We say is a range space. Let . An -sample (or -approximation) of is a subset such that

 maxA∈A∣∣¯ξQ(Q∩A)−¯ξP(P∩A)∣∣≤ε.

To create a coreset for the kernel norm, we want to generalize these notions of -samples to non-binary (-valued instead of -valued) functions, specifically to kernels. For two point sets , define , and when we have a singleton set and a subset then we write . Let be the maximum value a kernel can take on a dataset , which can be normalized to . We say a subset of is an -sample of if

 maxq|¯κP(P,q)−¯κS(S,q)|≤εK+.

The standard notion of VC-dimenion [47] (and related notion of shattering dimension) is fundamentally tied to the binary (-valued) nature of ranges, and as such, it does not directly apply to -samples of . Other researchers have defined different combinatorial dimensions that can be applied to kernels [15, 25, 1, 46]. The best result is based on -fat shattering dimension f [25], defined for a family of -valued functions and a ground set . A set is -fat shattered by if there exists a function such that for all subsets there exists some such that for every and for every . Then for the largest cardinality set that can be -fat shattered. Bartlett et al. [7] show that a random sample of elements creates an -sample (with probability at least ) with respect to for . Note that the -fat shattering dimension of Gaussian and other symmetric kernels in is (by setting for all ), the same as balls in , so this gives a random-sample construction for -samples of of size .

In this paper, we improve this result in two ways by directly relating a kernel range space to a similar (binary) range space . First, this improves the random-sample bound because it uses sample-complexity results for binary range spaces that have been heavily optimized. Second, this allows for all deterministic -sample constructions (which have no probability of failure) and can have much smaller size.

#### Constructions for -samples.

Vapnik and Chervonenkis [47] showed that the complexity of -samples is tied to the VC-dimension of the range space. That is, given a range space a subset is said to be shattered by if all subsets of can be realized as for . Then the VC-dimension of a range space is the cardinality of the largest subset that can be shattered by . Vapnik and Chervonenkis [47] showed that if the VC-dimension of a range space is , then a random sample of points from is an -sample with probability at least . This bound was improved to by Talagrand [44, 26].

Alternatively, Matousek [28] showed that -samples of size could be constructed deterministically, that is there is no probability of failure. A simpler algorithm with more thorough runtime analysis is presented in Chazelle and Matousek [12], which runs in time. Smaller -samples exist; in particular Matousek, Welzl, and Wernisch [31] and improved by Matousek [29] show that -samples exist of size , based on a discrepancy result that says there exists a labeling such that . It is alluded by Chazelle [11] that if an efficient construction for such a labeling existed, then an algorithm for creating -samples of size would follow, see also Phillips [34]. Recently, Bansal [6] provided a randomized polynomial algorithm for the entropy method, which is central in proving these existential coloring bounds. This leads to an algorithm that runs in time , as claimed by Charikar et al. [10, 33].

An alternate approach is through the VC-dimension of the dual range space , of (primal) range space , where is the set of subsets of ranges defined by containing the same element of . In our context, for range spaces defined by balls and ellipses of fixed orientation their dual range spaces have VC-dimension . Matousek [30] shows that a technique of matching with low-crossing number [13] along with Haussler’s packing lemma [22] can be used to construct a low discrepancy coloring for range spaces where , the VC-dimension of the dual range space is bounded. This technique can be made deterministic and runs in time. Invoking this technique in the Chazelle and Matousek [12] framework yields an -sample of size in time. Specifically, we attain the following result:

###### Lemma 5.1.

For discrete ranges spaces and for of size , we can construct an -sample of size in time.

For specific range spaces, the size of -samples can be improved beyond the VC-dimension bound. Phillips [34] showed for ranges consisting of axis-aligned boxes in , that -samples can be created of size . This can be generalized to ranges defined by predefined normal directions of size . These algorithms run in time . And for intervals over , -samples of can be created of size by sorting points and retaining every th point in the sorted order [27].

#### -Samples for kernels.

The super-level set of a kernel given one input and a value , is the set of all points such that . We say that a kernel is linked to a range space if for every possible input point and any value that the super-level set of defined by is equal to some . For instance multi-variate Gaussian kernels with no skew are linked to since all super-level sets are balls, and multi-variate Gaussian kernels with non-trivial covariance are linked to since all super-level sets are ellipses.

###### Theorem 5.1.

For any kernel linked to a range space , an -sample of for is a -sample of .

A (flawed) attempt at a proof may proceed by considering a series of approximate level-sets, within which each point has about the same function value. Since