# Sketching, Embedding, and Dimensionality Reduction for Information Spaces

## Abstract

*Information distances* like the Hellinger distance and the Jensen-Shannon
divergence have deep roots in information theory and machine learning. They are
used extensively in data analysis especially when the objects being compared are
high dimensional empirical probability distributions built from data. However, we lack common tools needed to actually use information distances in applications efficiently
and at scale with any kind of provable guarantees. We can’t sketch these
distances easily, or embed them in better behaved spaces, or even reduce the
dimensionality of the space while maintaining the probability structure of the data.

In this paper, we build these tools for information distances—both for the Hellinger distance and Jensen–Shannon divergence, as well as related measures, like the divergence. We first show that they can be sketched efficiently (i.e. up to multiplicative error in sublinear space) in the *aggregate* streaming model. This result is exponentially stronger than known upper bounds for sketching these distances in the *strict turnstile* streaming model. Second, we show a finite dimensionality embedding result for the Jensen-Shannon and divergences that preserves pair wise distances. Finally we prove
a dimensionality reduction result for the Hellinger, Jensen–Shannon, and
divergences that preserves the information geometry of the
distributions (specifically, by retaining the simplex structure of the
space). While our second result above already implies that these divergences can
be explicitly embedded in Euclidean space, retaining the simplex structure is
important because it allows us to continue doing inference in the reduced
space. In essence, we preserve not just the distance structure but the
underlying geometry of the space.

sketching]Sketching, Embedding, and Dimensionality
Reduction for Information Spaces^{1}

\addrUniversity of Utah \AND\NameRavi Kumar
\Emailtintin@google.com

\addrGoogle, Inc. \AND\NameAndrew McGregor
\Emailmcgregor@cs.umass.edu

\addrUniversity of Massachusetts–Amherst
\AND\NameSergei Vassilvitskii \Emailsergeiv@google.com

\addrGoogle,
Inc. \AND\NameSuresh Venkatasubramanian \Emailsuresh@cs.utah.edu

\addrUniversity of Utah
\fxsetuptheme=colorsig,mode=multiuser,inlineface=,envface=
\FXRegisterAuthorsuasuSuresh
\FXRegisterAuthorseaseSergei
\FXRegisterAuthoranaanAndrew
\FXRegisterAuthorraaraRavi
\FXRegisterAuthoramaamAmir

## 1 Introduction

The space of *information distances* includes many distances that are used
extensively in data analysis. These include the well-known Bregman divergences, the -divergences, and the -divergences. In this work we focus on a subclass of the -divergences that admit embeddings into some (possibly infinite-dimensional) Hilbert space, with a specific emphasis on the JS divergence. These divergences are used in statistical tests and estimators (Beran, 1977), as well as in image analysis (Peter and Rangarajan, 2008),
computer vision (Huang et al., 2005; Mahmoudi and Sapiro, 2009), and text
analysis (Dhillon et al., 2003; Eiron and McCurley, 2003). They were introduced by Csiszár (1967), and, in the most general case, also include
measures such as the Hellinger, JS, and divergences (here we consider a symmetrized variant of the distance).

To work with the geometry of these divergences effectively at scale and in high dimensions, we need algorithmic tools that can provide provably high quality approximate representations of the geometry.
The techniques of *sketching*, *embedding*, and *dimensionality
reduction* have evolved as ways of dealing with this problem.

A sketch for a set of points with respect to a property is a function that maps the data to a small summary from which property can be evaluated, albeit with some approximation error. Linear sketches are especially useful for estimating a derived property of a data stream in a fast and compact way.^{2}

Unfortunately, while these tools have been well developed for norms like and , we lack such tools for information distances. This is not just a theoretical concern: information distances are semantically more suited to many tasks in machine learning, and building the appropriate algorithmic toolkit to manipulate them efficiently would expand greatly the places where they can be used.

### 1.1 Our contributions

#### Sketching information divergences.

Guha, Indyk, and McGregor (2007) proved an impossibility result, showing that a large class of information divergences cannot be sketched in sublinear space, even if we allow for constant factor approximations. This result holds in the *strict turnstile streaming model*—a model in which coordinates of two points , are increased incrementally and we wish to maintain an estimate of the divergence between them. They left open the question of whether these divergences can be sketched in the *aggregate* streaming model, where each element of the stream gives the th coordinate of or in its entirety, but the coordinates may appear in an arbitrary order. We answer this in the affirmative for two important information distances, namely, the Jensen–Shannon and divergences.

A set of points under the Jensen–Shannon(JS) or divergence can be deterministically embedded into dimensions under with additive error. The same space bound holds when sketching JS or in the *aggregate* stream model.
{corollary}
Assuming polynomial precision, an AMS sketch for Euclidean distance can reduce the dimension to for a multiplicative approximation in the aggregate stream setting.

A set of points under the JS or divergence can be embedded into with with multiplicative error. For the both techniques, applying the Euclidean JL–Lemma can further reduce the dimension to in the offline setting.

#### Dimensionality reduction.

We then turn to the more challenging case of performing dimensionality reduction for information distances, where
we wish to preserve not only the distances between pairs of points (distributions),
but also the underlying simplicial structure of the space, so that we can
continue to interpret coordinates in the new space as probabilities. This notion
of a *structure-preserving* dimensionality reduction is implicit when
dealing with normed spaces (since we always map a normed space to another), but
requires an explicit mapping when dealing with more structured
spaces. We prove an analog of the classical JL–Lemma :

For the Jenson-Shannon, Hellinger, and divergences, there exists a structure preserving dimensionality reduction from the high dimensional simplex to a low dimensional simplex , where .

The theorem extends to “well-behaved” -divergences (See Section 3 for a precise definition). Moreover, the dimensionality reduction is constructive for any divergence with a finite dimensional kernel (such as the Hellinger divergence), or an infinite dimensional Kernel that can be sketched in finite space, as we show is feasible for the JS and divergences.

#### Our techniques.

The unifying approach of our three results—sketching, embedding into , and dimensionality reduction—is to analyze carefully the infinite dimensional kernel of the information divergences. Quantizing and truncating the kernel yields the sketching result, sampling repeatedly from it produces an embedding into . Finally given such an embedding, we show how to perform dimensionality reduction by proving that each of the divergences admits a region of the simplex where it is similar to . We point out that to the best of our knowledge, this is the first result that explicitly uses the kernel representation of these information distances to build approximate geometric structures; while the *existence* of a kernel for the Jensen–Shannon distance was well-known, this structure had never been exploited for algorithmic advantage.

## 2 Related Work

The works by Fuglede and Topsøe (2004), and then by Vedaldi and Zisserman (2012) study embeddings of information divergences into an infinite dimensional Hilbert space by representing them as an integral along a one-dimensional curve in . Vedaldi and Zisserman give an explicit formulation of this kernel for JS and divergences, for which a discretization (by quantizing and truncating) yields an additive error embedding into a finite dimensional . However, they do not obtain quantitative bounds on the dimension of target space needed or address the question of multiplicative approximation guarantees.

In the realm of sketches, Guha, Indyk, and McGregor (2007) show space (where is the length of the stream) is required in the strict turnstile model even for a constant factor multiplicative approximation. These bounds hold for a wide range of information divergences, including JS, Hellinger and the divergences. They show however that an *additive* error of can be achieved using space.
In contrast, one can indeed achieve a multiplicative approximation in the aggregate streaming model for information divergences that have a finite dimensional embedding into . For instance, Guha et al. (2006) observe that for the Hellinger distance that has a trivial such embedding, sketching is equivalent to sketching and hence may be done up to a -multiplicative approximation in space. This immediately implies a constant factor approximation of JS and divergences in the same space, but no bounds have been known prior to our work for a -sketching result for JS and divergences in *any* streaming model.

Moving onto dimensionality reduction from simplex to simplex, in the only other work we are aware of, Kyng, Phillips, and Venkatasubramanian (2010) show a limited dimensionality reduction result for the Hellinger distance. Their approach works by showing that if the input points lie in a specific region of the simplex, then a standard random projection will keep the points on a lower-dimensional simplex while preserving the distances approximately. Unfortunately, this region is a small ball centered in the interior of the simplex, which further shrinks with the dimension. This is in sharp contrast to our work here, where the input points are unconstrained.

While it does not admit a kernel, the distance is also an -divergence, and it is therefore natural to investigate its potential connection with the measures we study here. For , it is well known that significant dimensionality reduction is not possible: an embedding with distortion requires the points to be embedded in dimensions, which is nearly linear. This result was proved (and strengthened) in a series of results (Andoni et al., 2011; Regev, 2012; Lee and Naor, 2004; Brinkman and Charikar, 2005).

The general literature of sketching and embeddability in normed spaces is too extensive to be reviewed here: we point the reader to Andoni et al. (2014) for a full discussion of results in this area. One of the most famous applications of dimension reduction is the Johnson–Lindenstrauss(JL) Lemma, which states that any set of points in can be embedded into dimensions in the same space while preserving pairwise distances to within . This result has become a core step in algorithms for near neighbor search (Ailon and Chazelle, 2006; Andoni and Indyk, 2006), speeding up clustering algorithms (Boutsidis et al., 2015), and efficient approximation of matrices (Clarkson and Woodruff, 2013), among many others.

Although sketching, embeddability, and dimensionality reduction are related operations, they are not always equivalent. For example, even though and have very different behavior under dimensionality reduction, they can both be sketched to an arbitrary error in the turnstile model (and in fact any norm, can be sketched using -stable distributions (Indyk, 2000)). In the offline setting, Andoni et al. (2014) show that sketching and embedding of normed spaces are equivalent: for any finite-dimensional normed space , a constant distortion and space sketching algorithm for exists if and only if there exists a linear embedding of into .

## 3 Background

In this section, we define precisely the class of information divergences that we work with, and their specific properties that allow us to obtain sketching, embedding, and dimensionality results.
For what follows denotes the *-simplex*: and . Let
.
{definition}[-divergence]
Let and be two distributions on . A convex function such that gives rise to an
*-divergence*
as:

where we define , , and .

[Regular distance]
We call a distance function *regular* if there
exists a feature map , where is a (possibly infinite dimensional) Hilbert
space, such that:

The work of Fuglede and Topsøe (2004) establishes that JS is regular; Vedaldi and Zisserman (2012) construct an explicit feature map for the JS kernel, as , where is given by

Hence we have for , , .
The “embedding” for a given distribution is then
the concatenation of the functions , i.e.,
.
{definition}[Well-behaved divergence]
A *well-behaved* -divergence is a regular -divergence such that , , ,
and exists.

In this paper, we will focus on the following well-behaved -divergences.
{definition}
The *Jensen–Shannon* (JS), *Hellinger*, and
divergences between distributions and are defined as:

## 4 Embedding JS into

We present two algorithms for embedding JS into . The first is deterministic and gives an additive error approximation whereas the second is randomized but yields a multiplicative approximation in an offline setting. The advantage of the first algorithm is that it can be realized in the streaming model, and if we make a standard assumption of polynomial precision in the streaming input, yields a -multiplicative approximation as well in this setting.

We derive some terms in the kernel representation of which we will find convenient. First, the explicit formulation in Section 3 yields that for , :

For convenience, we now define:

and

We can then write where

It is easy to verify that is a distribution, i.e., .

### 4.1 Deterministic embedding

We will produce an embedding , where each is an integral that we can discretize by quantizing and truncating carefully.

\DontPrintSemicolon\KwIn where coordinates are ordered by arrival.
\KwOutA vector of length
; ,

\For to
\For to
\For to

\Return concatenated with .

To analyze Algorithm 4.1, we first obtain bounds on the function and its derivative. {lemma} For , we have and . {proof} Clearly . Furthermore, since , we have

where the last inequality follows since . The next two steps are useful to approximate the infinite-dimensional continuous representation by a finite-dimensional discrete representation by appropriately truncating and quantizing the integral. {lemma}[Truncation] For ,

The first inequality follows since . For the second inequality, we use :

where the last line follows if .

Define for and .

[Quantization] For any ,

First note that

Hence, .

Given a real number , define vectors and indexed by where by:

and note that

Therefore,

where the second to last line follows from Lemma 4.1 and the last line follows from Lemma 4.1, since .

Define the vector to be the vector generated by concatenating and for . Then if follows that

Hence we have reduced the problem of estimating to estimation. Rescaling ensures the additive error is while the length of the vectors and is . {theorem} Algorithm 4.1 embeds a set of points under JS into dimensions under with additive error, independent of the size of . Note that using the JL-Lemma, the dimensionality of the target space can be reduced to . Theorem 4.1, along with the AMS sketch of Alon et al. (1996), and the standard assumption of polynomial precision immediately implies: {corollary} There is an algorithm that works in the aggregate streaming model to approximate JS to within -multiplicative factor using space. As noted earlier, this is the first algorithm in the aggregate streaming model to obtain an -multiplicative approximation to JS, which contrasts against linear space lower bounds for the same problem in the update streaming model.

### 4.2 Randomized embedding

In this section we show how to embed points of JS into with distortion where . ^{3}

For fixed , we first consider the random variable where takes the value with probability . (Recall that is a distribution.) We compute the first and second moments of . {theorem} and . {proof} The expectation follows immediately from the definition:

To bound the variance it will be useful to define the function corresponding to the one-dimensional Hellinger distance that is related to as follows. We now state two claims regarding and :

###### Claim 4.1

For all , .

Let correspond to the one-dimensional distance. Then, we have

This shows that . To show we refer the reader to (Topsøe, 2000, Section 3). Combining these two relationships gives us our claim.

We then bound in terms of as follows.

###### Claim 4.2

For all , .

Without loss of generality, assume .

and hence as required.

These claims allow us to bound the variance:

This naturally gives rise to the following algorithm.
{algorithm}
\DontPrintSemicolon\KwIn.
\KwOutA vector of length
;

\For to
a draw from ;
\For to
\For to

\Return concatenated with .
Let be independent samples chosen according to . For any distribution on , define vectors where, for ,

Let be a concatenation of and over all . Then note that and Hence, for , by an application of the Chebyshev bound,

(4.1) |

By an application of the union bound over all pairs of points:

And hence, if is a concatenation of over all , then with probability at least it holds for all ,:

The final length of the vectors is then for approximately preserving distances between every pair of points with probability at least . This can be reduced further to by simply applying the JL-Lemma.

## 5 Embedding into

We give here two algorithms for embedding the divergence into . The computation and resulting two algorithms are highly analogous to Section 4. First, the explicit formulation given by Vedaldi and Zisserman (2012) yields that for , :

For convenience, we now define:

and

We can then write where

It is easy to verify that is a distribution, i.e., .

### 5.1 Deterministic embedding

We will produce an embedding , where each is an integral that we discretize appropriately.

\DontPrintSemicolon\KwIn where coordinates are ordered by arrival.
\KwOutA vector of length
; ,

\For to
\For to
\For to

\Return concatenated with .

For , we have and .

Similar to Section 4, the next two steps analyze truncating and quantizing the integral. {lemma}[Truncation] For ,

The first inequality follows since . For the second inequality, we use :

where the last line follows if .

Define for and . We recall the following Lemma from Section 4:

[Quantization] For any ,

Given a real number , define vectors and indexed by where by:

and note that

Therefore,

where the second to last line follows from Lemma 5.1 and the last line follows from Lemma 5.1, since .

Define the vector to be the vector generated by concatenating and for . Then if follows that

Hence we have reduced the problem of estimating to estimation. Rescaling ensures the additive error is while the length of the vectors and is . {theorem} Algorithm 5.1 embeds a set of points under into