[

# [

[
###### Abstract

Information distances like the Hellinger distance and the Jensen-Shannon divergence have deep roots in information theory and machine learning. They are used extensively in data analysis especially when the objects being compared are high dimensional empirical probability distributions built from data. However, we lack common tools needed to actually use information distances in applications efficiently and at scale with any kind of provable guarantees. We can’t sketch these distances easily, or embed them in better behaved spaces, or even reduce the dimensionality of the space while maintaining the probability structure of the data.

In this paper, we build these tools for information distances—both for the Hellinger distance and Jensen–Shannon divergence, as well as related measures, like the divergence. We first show that they can be sketched efficiently (i.e. up to multiplicative error in sublinear space) in the aggregate streaming model. This result is exponentially stronger than known upper bounds for sketching these distances in the strict turnstile streaming model. Second, we show a finite dimensionality embedding result for the Jensen-Shannon and divergences that preserves pair wise distances. Finally we prove a dimensionality reduction result for the Hellinger, Jensen–Shannon, and divergences that preserves the information geometry of the distributions (specifically, by retaining the simplex structure of the space). While our second result above already implies that these divergences can be explicitly embedded in Euclidean space, retaining the simplex structure is important because it allows us to continue doing inference in the reduced space. In essence, we preserve not just the distance structure but the underlying geometry of the space.

sketching]Sketching, Embedding, and Dimensionality Reduction for Information Spaces11footnotemark: 1 Abdullah et al.]Amirali Abdullah amirali@cs.utah.edu
mcgregor@cs.umass.edu
suresh@cs.utah.edu

[ [

**footnotetext: This research was funded in part by the NSF under grants CCF-0953066, CCF-0953754, CCF-1320719 and BIGDATA-1251049 and by a Google Faculty Research Award.

## 1 Introduction

The space of information distances includes many distances that are used extensively in data analysis. These include the well-known Bregman divergences, the -divergences, and the -divergences. In this work we focus on a subclass of the -divergences that admit embeddings into some (possibly infinite-dimensional) Hilbert space, with a specific emphasis on the JS divergence. These divergences are used in statistical tests and estimators (Beran, 1977), as well as in image analysis (Peter and Rangarajan, 2008), computer vision (Huang et al., 2005; Mahmoudi and Sapiro, 2009), and text analysis (Dhillon et al., 2003; Eiron and McCurley, 2003). They were introduced by Csiszár (1967), and, in the most general case, also include measures such as the Hellinger, JS, and divergences (here we consider a symmetrized variant of the distance).

To work with the geometry of these divergences effectively at scale and in high dimensions, we need algorithmic tools that can provide provably high quality approximate representations of the geometry. The techniques of sketching, embedding, and dimensionality reduction have evolved as ways of dealing with this problem.

A sketch for a set of points with respect to a property is a function that maps the data to a small summary from which property can be evaluated, albeit with some approximation error. Linear sketches are especially useful for estimating a derived property of a data stream in a fast and compact way.111Indeed Li, Nguyen, and Woodruff (2014) show that any optimal one-pass streaming sketch algorithm in the turnstile model can be reduced to a linear sketch with logarithmic space overhead. Complementing sketching, embedding techniques are one to one mappings that transform a collection of points lying in one space to another (presumably easier) space , while approximately preserving distances between points. Dimensionality reduction is a special kind of embedding which preserves the structure of the space, while reducing its dimension. These embedding techniques can be used in an almost “plug-and-play” fashion to speed up many algorithms in data analysis: for example for near neighbor search (and classification), clustering, and closest pair calculations.

Unfortunately, while these tools have been well developed for norms like and , we lack such tools for information distances. This is not just a theoretical concern: information distances are semantically more suited to many tasks in machine learning, and building the appropriate algorithmic toolkit to manipulate them efficiently would expand greatly the places where they can be used.

### 1.1 Our contributions

#### Sketching information divergences.

Guha, Indyk, and McGregor (2007) proved an impossibility result, showing that a large class of information divergences cannot be sketched in sublinear space, even if we allow for constant factor approximations. This result holds in the strict turnstile streaming model—a model in which coordinates of two points , are increased incrementally and we wish to maintain an estimate of the divergence between them. They left open the question of whether these divergences can be sketched in the aggregate streaming model, where each element of the stream gives the th coordinate of or in its entirety, but the coordinates may appear in an arbitrary order. We answer this in the affirmative for two important information distances, namely, the Jensen–Shannon and divergences.

###### Theorem 1

A set of points under the Jensen–Shannon(JS) or divergence can be deterministically embedded into dimensions under with additive error. The same space bound holds when sketching JS or in the aggregate stream model.

###### Corollary 2

Assuming polynomial precision, an AMS sketch for Euclidean distance can reduce the dimension to for a multiplicative approximation in the aggregate stream setting.

###### Theorem 3

A set of points under the JS or divergence can be embedded into with with multiplicative error.

For the both techniques, applying the Euclidean JL–Lemma can further reduce the dimension to in the offline setting.

#### Dimensionality reduction.

We then turn to the more challenging case of performing dimensionality reduction for information distances, where we wish to preserve not only the distances between pairs of points (distributions), but also the underlying simplicial structure of the space, so that we can continue to interpret coordinates in the new space as probabilities. This notion of a structure-preserving dimensionality reduction is implicit when dealing with normed spaces (since we always map a normed space to another), but requires an explicit mapping when dealing with more structured spaces. We prove an analog of the classical JL–Lemma :

###### Theorem 4

For the Jenson-Shannon, Hellinger, and divergences, there exists a structure preserving dimensionality reduction from the high dimensional simplex to a low dimensional simplex , where .

The theorem extends to “well-behaved” -divergences (See Section 3 for a precise definition). Moreover, the dimensionality reduction is constructive for any divergence with a finite dimensional kernel (such as the Hellinger divergence), or an infinite dimensional Kernel that can be sketched in finite space, as we show is feasible for the JS and divergences.

#### Our techniques.

The unifying approach of our three results—sketching, embedding into , and dimensionality reduction—is to analyze carefully the infinite dimensional kernel of the information divergences. Quantizing and truncating the kernel yields the sketching result, sampling repeatedly from it produces an embedding into . Finally given such an embedding, we show how to perform dimensionality reduction by proving that each of the divergences admits a region of the simplex where it is similar to . We point out that to the best of our knowledge, this is the first result that explicitly uses the kernel representation of these information distances to build approximate geometric structures; while the existence of a kernel for the Jensen–Shannon distance was well-known, this structure had never been exploited for algorithmic advantage.

## 2 Related Work

The works by Fuglede and Topsøe (2004), and then by Vedaldi and Zisserman (2012) study embeddings of information divergences into an infinite dimensional Hilbert space by representing them as an integral along a one-dimensional curve in . Vedaldi and Zisserman give an explicit formulation of this kernel for JS and divergences, for which a discretization (by quantizing and truncating) yields an additive error embedding into a finite dimensional . However, they do not obtain quantitative bounds on the dimension of target space needed or address the question of multiplicative approximation guarantees.

In the realm of sketches, Guha, Indyk, and McGregor (2007) show space (where is the length of the stream) is required in the strict turnstile model even for a constant factor multiplicative approximation. These bounds hold for a wide range of information divergences, including JS, Hellinger and the divergences. They show however that an additive error of can be achieved using space. In contrast, one can indeed achieve a multiplicative approximation in the aggregate streaming model for information divergences that have a finite dimensional embedding into . For instance, Guha et al. (2006) observe that for the Hellinger distance that has a trivial such embedding, sketching is equivalent to sketching and hence may be done up to a -multiplicative approximation in space. This immediately implies a constant factor approximation of JS and divergences in the same space, but no bounds have been known prior to our work for a -sketching result for JS and divergences in any streaming model.

Moving onto dimensionality reduction from simplex to simplex, in the only other work we are aware of, Kyng, Phillips, and Venkatasubramanian (2010) show a limited dimensionality reduction result for the Hellinger distance. Their approach works by showing that if the input points lie in a specific region of the simplex, then a standard random projection will keep the points on a lower-dimensional simplex while preserving the distances approximately. Unfortunately, this region is a small ball centered in the interior of the simplex, which further shrinks with the dimension. This is in sharp contrast to our work here, where the input points are unconstrained.

While it does not admit a kernel, the distance is also an -divergence, and it is therefore natural to investigate its potential connection with the measures we study here. For , it is well known that significant dimensionality reduction is not possible: an embedding with distortion requires the points to be embedded in dimensions, which is nearly linear. This result was proved (and strengthened) in a series of results (Andoni et al., 2011; Regev, 2012; Lee and Naor, 2004; Brinkman and Charikar, 2005).

The general literature of sketching and embeddability in normed spaces is too extensive to be reviewed here: we point the reader to Andoni et al. (2014) for a full discussion of results in this area. One of the most famous applications of dimension reduction is the Johnson–Lindenstrauss(JL) Lemma, which states that any set of points in can be embedded into dimensions in the same space while preserving pairwise distances to within . This result has become a core step in algorithms for near neighbor search (Ailon and Chazelle, 2006; Andoni and Indyk, 2006), speeding up clustering algorithms (Boutsidis et al., 2015), and efficient approximation of matrices (Clarkson and Woodruff, 2013), among many others.

Although sketching, embeddability, and dimensionality reduction are related operations, they are not always equivalent. For example, even though and have very different behavior under dimensionality reduction, they can both be sketched to an arbitrary error in the turnstile model (and in fact any norm, can be sketched using -stable distributions (Indyk, 2000)). In the offline setting, Andoni et al. (2014) show that sketching and embedding of normed spaces are equivalent: for any finite-dimensional normed space , a constant distortion and space sketching algorithm for exists if and only if there exists a linear embedding of into .

## 3 Background

In this section, we define precisely the class of information divergences that we work with, and their specific properties that allow us to obtain sketching, embedding, and dimensionality results. For what follows denotes the -simplex: and . Let .

###### Definition 5 (f-divergence)

Let and be two distributions on . A convex function such that gives rise to an -divergence as:

 Df(p,q)=d∑i=1pi⋅f(qipi),

where we define , , and .

###### Definition 6 (Regular distance)

We call a distance function regular if there exists a feature map , where is a (possibly infinite dimensional) Hilbert space, such that:

 D(x,y)=∥ϕ(x)−ϕ(y)∥2∀x,y∈X.

The work of Fuglede and Topsøe (2004) establishes that JS is regular; Vedaldi and Zisserman (2012) construct an explicit feature map for the JS kernel, as , where is given by

 Ψx(ω)=exp(iωlnx)√2xsech(πω)(ln4)(1+4ω2).

Hence we have for , , . The “embedding” for a given distribution is then the concatenation of the functions , i.e., .

###### Definition 7 (Well-behaved divergence)

A well-behaved -divergence is a regular -divergence such that , , , and exists.

In this paper, we will focus on the following well-behaved -divergences.

###### Definition 8

The Jensen–Shannon (JS), Hellinger, and divergences between distributions and are defined as:

 JS(p,q) = ∑ipilog2pipi+qi+qilog2qipi+qi, He(p,q) = ∑i(√pi−√qi)2, χ2(p,q) = ∑i(pi−qi)2pi+qi.

## 4 Embedding JS into ℓ22

We present two algorithms for embedding JS into . The first is deterministic and gives an additive error approximation whereas the second is randomized but yields a multiplicative approximation in an offline setting. The advantage of the first algorithm is that it can be realized in the streaming model, and if we make a standard assumption of polynomial precision in the streaming input, yields a -multiplicative approximation as well in this setting.

We derive some terms in the kernel representation of which we will find convenient. First, the explicit formulation in Section 3 yields that for , :

 JS(x,y) =∫+∞−∞∥∥ ∥∥eiωlnx√2xsech(πω)(ln4)(1+4ω2)−eiωlny√2ysech(πω)(ln4)(1+4ω2)∥∥ ∥∥2dω =∫+∞−∞(2sech(πω)(ln4)(1+4ω2))∥√xeiωlnx−√yeiωlny∥2dω.

For convenience, we now define:

 h(x,y,ω) =∥√xeiωlnx−√yeiωlny∥2 =(√xcos(ωlnx)−√ycos(ωlny))2+(√xsin(ωlnx)−√ysin(ωlny))2,

and

 κ(ω)=2sech(πω)(ln4)(1+4ω2) .

We can then write where

 fJ(x,y)=∫∞−∞h(x,y,ω)κ(ω)dω=xlog(2xx+y)+ylog(2yx+y).

It is easy to verify that is a distribution, i.e., .

### 4.1 Deterministic embedding

We will produce an embedding , where each is an integral that we can discretize by quantizing and truncating carefully.

To analyze Algorithm 1, we first obtain bounds on the function and its derivative.

###### Lemma 9

For , we have and .

Proof  Clearly . Furthermore, since , we have

 h(x,y,ω)≤∣∣√xeiωlnx∣∣2+∣∣√yeiωlny∣∣2=x+y≤2.
 Next, ∣∣∣∂h(x,y,ω)∂ω∣∣∣ = ∣∣2(√xcos(ωlnx)−√ycos(ωlny))(−√xsin(ωlnx)lnx+√ysin(ωlny)lny) +2(√xsin(ωlnx)−√ysin(ωlny))(√xcos(ωlnx)lnx−√ycos(ωlny)lny)∣∣ ≤ ∣∣2(√x+√y)(√xlnx+√ylny)∣∣+2∣∣(√x+√y)(√xlnx+√ylny)∣∣≤16,

where the last inequality follows since .
The next two steps are useful to approximate the infinite-dimensional continuous representation by a finite-dimensional discrete representation by appropriately truncating and quantizing the integral.

###### Lemma 10 (Truncation)

For ,

 fJ(x,y)≥∫t−th(x,y,ω)κ(ω)dω≥fJ(x,y)−ε .

Proof  The first inequality follows since . For the second inequality, we use :

 ∫−t−∞h(x,y,ω)κ(ω)dω+∫∞th(x,y,ω)κ(ω)dω≤4∫∞tκ(ω)dω<4∫∞t4e−πωln4dω<4e−t≤ε

where the last line follows if .

Define for and .

###### Lemma 11 (Quantization)

For any ,

 ∫bah(x,y,ω)κ(ω)dω=∫ba~h(x,y,ω)κ(ω)dω±ε .

Proof  First note that

 |~h(x,y,ω)−h(x,y,ω)|≤(ε16)⋅maxx,y∈[0,1],ω∣∣∣∂h(x,y,ω)∂ω∣∣∣≤ε .

Hence, .

Given a real number , define vectors and indexed by where by:

 vz=√zcos(ωilnz)√∫ωi+1ωiκ(ω)dω,uz=√zsin(ωilnz)√∫ωi+1ωiκ(ω)dω,

and note that

 (vxi−vyi)2+(uxi−uyi)2=h(x,y,ωi)∫ωi+1ωiκ(ω)dω.

Therefore,

 ∥vx−vy∥22+∥ux−uy∥22 = ∫wi∗+1w−i∗~h(x,y,ω)κ(ω)dω=∫wi∗+1w−i∗h(x,y,ω)κ(ω)dω±ε = ∫∞−∞h(x,y,ω)κ(ω)dω±2ε=fJ(x,y)±2ε,

where the second to last line follows from Lemma 11 and the last line follows from Lemma 10, since .

Define the vector to be the vector generated by concatenating and for . Then if follows that

 ∥ap−aq∥22=JS(p,q)±2εd

Hence we have reduced the problem of estimating to estimation. Rescaling ensures the additive error is while the length of the vectors and is .

###### Theorem 12

Algorithm 1 embeds a set of points under JS into dimensions under with additive error, independent of the size of .

Note that using the JL-Lemma, the dimensionality of the target space can be reduced to . Theorem 12, along with the AMS sketch of Alon et al. (1996), and the standard assumption of polynomial precision immediately implies:

###### Corollary 13

There is an algorithm that works in the aggregate streaming model to approximate JS to within -multiplicative factor using space.

As noted earlier, this is the first algorithm in the aggregate streaming model to obtain an -multiplicative approximation to JS, which contrasts against linear space lower bounds for the same problem in the update streaming model.

### 4.2 Randomized embedding

In this section we show how to embed points of JS into with distortion where . 222If we ignore precision constraints on sampling from a continuous distribution in a streaming algorithm, then this also would yield a sketching bound of for a multiplicative approximation.

For fixed , we first consider the random variable where takes the value with probability . (Recall that is a distribution.) We compute the first and second moments of .

###### Theorem 14

and .

Proof  The expectation follows immediately from the definition:

 E[T]=∫∞−∞h(x,y,ω)κ(ω)dω=fJ(x,y).

To bound the variance it will be useful to define the function corresponding to the one-dimensional Hellinger distance that is related to as follows. We now state two claims regarding and :

###### Claim 4.1

For all , .

Proof  Let correspond to the one-dimensional distance. Then, we have

 fχ(x,y)fH(x,y) =(x−y)2(x+y)(√x−√y)2=(√x+√y)2x+y=x+y+2√xyx+y≥1 .

This shows that . To show we refer the reader to (Topsøe, 2000, Section 3). Combining these two relationships gives us our claim.

We then bound in terms of as follows.

###### Claim 4.2

For all , .

Proof  Without loss of generality, assume .

 √h(x,y,ω) = |√x⋅eiωlnx−√y⋅eiωlny| ≤ |√x⋅eiωlnx−√y⋅eiωlnx|+|√y⋅eiωlnx−√y⋅eiωlny| = |√x−√y|+√y⋅|eiωlnx−eiωlny| = |√x−√y|+√y⋅2⋅|sin(ωln(x/y)/2)| ≤ √fH(x,y)+√y⋅2⋅|ωln(√x/y)| ≤ √fH(x,y)+√y⋅2⋅|√x/y−1|⋅|ω| = √fH(x,y)+2√fH(x,y)⋅|ω|

and hence as required.

These claims allow us to bound the variance:

 var[T]≤E[T2]=∫∞−∞(h(x,y,ω))2κ(ω)dω ≤ fH(x,y)2∫∞−∞(1+2|ω|)4κ(ω)dω = fH(x,y)2⋅8.94<36fJ(x,y)2,

This naturally gives rise to the following algorithm.

Let be independent samples chosen according to . For any distribution on , define vectors where, for ,

 vpi,j=√pi⋅cos(ωjlnpi)/t,upi,j=√pi⋅sin(ωjlnpi)/t.

Let be a concatenation of and over all . Then note that and Hence, for , by an application of the Chebyshev bound,

 Pr[|∥vpi−vqi∥22−fJ(pi,qi)|≥εfJ(x,y)]≤36ε−2/t=(nd)−2. (4.1)

By an application of the union bound over all pairs of points:

 Pr[∃i∈[d] , p,q∈P|∥vpi−vqi∥22−fJ(pi,qi)|≥εfJ(pi,qi)]≤1/d.

And hence, if is a concatenation of over all , then with probability at least it holds for all ,:

 (1−ε)JS(p,q)≤∥vp−vq∥≤(1+ε)JS(p,q).

The final length of the vectors is then for approximately preserving distances between every pair of points with probability at least . This can be reduced further to by simply applying the JL-Lemma.

## 5 Embedding χ2 into ℓ22

We give here two algorithms for embedding the divergence into . The computation and resulting two algorithms are highly analogous to Section 4. First, the explicit formulation given by Vedaldi and Zisserman (2012) yields that for , :

 χ2(x,y) =∫+∞−∞∥∥eiωlnx√xsech(πω)−eiωlny√ysech(πω)∥∥2dω =∫+∞−∞(sech(πω))∥√xeiωlnx−√yeiωlny∥2dω.

For convenience, we now define:

 h(x,y,ω)=∥√xeiωlnx−√yeiωlny∥2

and

We can then write where

 fχ(x,y)=∫∞−∞h(x,y,ω)κχ(ω)dω=(x−y)2x+y.

It is easy to verify that is a distribution, i.e., .

### 5.1 Deterministic embedding

We will produce an embedding , where each is an integral that we discretize appropriately.

###### Lemma 15

For , we have and .

Similar to Section 4, the next two steps analyze truncating and quantizing the integral.

###### Lemma 16 (Truncation)

For ,

 fχ(x,y)≥∫t−th(x,y,ω)κχ(ω)dω≥fχ(x,y)−ε .

Proof  The first inequality follows since . For the second inequality, we use :

 ∫−t−∞h(x,y,ω)κχ(ω)dω+∫∞th(x,y,ω)κχ(ω)dω≤4∫∞tκχ(ω)dω<4∫∞t2e−πωdω<3e−t≤ε

where the last line follows if .

Define for and . We recall the following Lemma from Section 4:

###### Lemma 17 (Quantization)

For any ,

 ∫bah(x,y,ω)κχ(ω)dω=∫ba~h(x,y,ω)κχ(ω)dω±ε .

Given a real number , define vectors and indexed by where by:

 vz=√zcos(ωilnz)√∫ωi+1ωiκχ(ω)dω,uz=√zsin(ωilnz)√∫ωi+1ωiκχ(ω)dω,

and note that

 (vxi−vyi)2+(uxi−uyi)2=h(x,y,ωi)∫ωi+1ωiκχ(ω)dω.

Therefore,

 ∥vx−vy∥22+∥ux−uy∥22 = ∫wi∗+1w−i∗~h(x,y,ω)κχ(ω)dω=∫wi∗+1w−i∗h(x,y,ω)κχ(ω)dω±ε = ∫∞−∞h(x,y,ω)κχ(ω)d