Function space locality-sensitive hashing
We discuss the problem of performing similarity search over function spaces. To perform search over such spaces in a reasonable amount of time, we use locality-sensitive hashing (LSH). We present two methods that allow LSH functions on to be extended to spaces: one using function approximation in an orthonormal basis, and another using (quasi-)Monte Carlo-style techniques. We use the presented hashing schemes to construct an LSH family for Wasserstein distance over one-dimensional, continuous probability distributions.
Similarity search over function spaces is an interesting but relatively unexplored problem. The reasons this type of similarity search remains unexplored are fairly straightforward: for one, most datasets encountered in applications are best thought of as consisting of discrete vectors, rather than continuous functions. Even in applications where the data are best modelled as elements of a function space, performing similarity search is very computationally intensive. Calculating just one similarity often requires an integral computation, potentially over a multidimensional domain.
Nonetheless, similarity search over function spaces is more than just a problem of theoretical interest: for instance, Wasserstein metric, which may be defined over a function space, has applications in fields such as image search Peleg et al. (1989). And an intriguing recent application of similarity search over function spaces is its potential use as a heuristic in optimizing machine learning models; e.g., Chen et al. (2010) searches over sets of weak learners generated using AdaBoost. Other applications of function space similarity search in this vein – e.g., comparing the features learned by neurons in a neural network – could prove a promising avenue for improving methods to train machine learning models in the future.
Our research seeks to make similarity search over function spaces a tractable problem by accelerating it with locality-sensitive hashing (LSH). In doing so, we can enable a wide range of applications involving similarity search over functions by dramatically reducing computational loads.
This paper makes the following contributions:
As a motivation for why we might be interested in performing similarity search in function spaces, we present the example of Wasserstein metric to compare probability distributions. In the case of one-dimensional distributions, we present an algorithm for hashing Wasserstein metrics of order .
We discuss how in general hash functions for can be extended to function spaces. We describe two methods for performing this extension:
In the specific (but common) case of , we describe a method that uses function approximation in orthonormal bases to perform hashing.
More generally, we use Monte Carlo methods to create an approximate embedding of in in order to perform hashing. This method works for all .
In this paper, we use to denote a (measurable) subset of . signifies the function space, defining (which is a norm if ), over the measure space , where is a -algebra on . is used to imply that the measure is Lebesgue measure. In the special case of , the inner product is denote (and is implicit).
Additionally, we let be the space of real sequences indexed by some set , whose norm is defined as ; is shorthand for . The inner product for the case of is denoted (and is implicit).
Finally, is used to indicate the order- Wasserstein distance between probability distributions and . The Wasserstein metric is defined in Section 2.2.
2.1 Locality-sensitive hashing
Locality-sensitive hashing is a method for accelerating similarity search (e.g. via -nearest neighbors) that uses hash tables to reduce the number of queries that must be performed. Roughly, a family of hash functions is locality-sensitive for some similarity function if a hash function, drawn at random from the family, maps sufficiently similar inputs to the same hash with high probability, while keeping a small probability of a hash collision for sufficiently disparate inputs.
The idea behind LSH is that given a query with which to perform similarity search, we can reduce the size of our search space by only comparing our query against those elements of the database that experience a hash collision with the query. This can accelerate the process of performing similarity search by orders of magnitude, especially when our database is large.
To fine-tune the probability of a hash collision, one generally uses multiple hash functions sampled from the same LSH family simultaneously, replacing a single hash with a tuple of hashes. It is also common practice to use multiple hash tables simultaneously, so that a hash collision between a query and a database entry in one table is equivalent to a hash collision in every table. With multi-probe LSH Lv et al. (2007), one can further fine-tune collision probabilities by looking through buckets that don’t necessarily correspond exactly to the query point’s hash, but rather correspond to “nearby” hashes.
LSH families exist for a number of different similarity measures. Of interest in this paper are LSH families used for comparing vectors in , such as hash functions for cosine similarity Charikar (2002), distance for all Datar et al. (2004), and inner product similarity Shrivastava & Li (2014, 2015).
2.2 Wasserstein metric
The -Wasserstein distance is a metric between a pair of probability distributions and on a metric space . It is defined as
where is the set of probability distributions on with marginals and .
There is also a discrete analogue to (1) for representing discrete probability distributions over a set of points, where is formulated as the solution to the linear program:
where is the distance between points and , and is viewed as the “flow of mass” from to , analogous to in (1). Equations (1) and (2) are both deeply tied to optimal transport problems — in particular, is often referred to as the earth mover’s distance.
For this paper, we will consider the special but nonetheless useful case of equation (1) where and . In this case, the Wasserstein distance has the closed-form expression
where and are the c.d.f.’s of and , respectively. This is valid for any due to convexity of norms (Santambrogio, 2015, Prop 2.17)
While computing via equation (1) is typically expensive, the simplified expression for one-dimensional Wasserstein distance is significantly more tractable. Nonetheless, it can still present computational problems for similarity search. For one, calculating (2.2) with quadrature rules can be expensive when we want to achieve low error. It is also often the case that we don’t have explicit representations for and , but rather samples of the underlying random variables and with those distributions. Approximating in this case is difficult using quadrature rules, especially if the number of samples of each random variable is different. The easiest way to approximate is to model and as step functions, but this approach may have relatively high numerical error. Moreover, computing still takes at least time (where and are the number of samples of and respectively), which can be painfully large if we have many samples of at least one of the random variables.
LSH offers a promising method to accelerate similarity search with Wasserstein distance. Our goal in Section 3 is to identify methods by which we can construct LSH families for similarities defined over function spaces, including 1D Wasserstein distance.
If we can construct an LSH family for the similarity function on the space , then it is apparent from equation (2.2) that we can apply a function from that LSH family to and to get a locality-sensitive hash for .
2.3 Related work
There have previously been attempts to construct LSH families for the discrete analogue to Wasserstein distance in equation (2), although we are not aware of any attempts to use LSH for the continuous problem (1). Charikar (2002) applied a transformation to and such that the distance between them was bounded below by and above by . Indyk & Thaper (2003) presented a technique for approximately embedding in , over which they then constructed an LSH family. More generally, the -distance hash of Datar et al. (2004) can be used to generate an LSH family for any similarity function or metric space, provided that the similarity or metric can be approximately embedded in for .
Research into LSH for function spaces is relatively sparse. Tang et al. (2017), covering a hash function for cosine similarity between probability distributions, seems to be the clearest example of LSH over spaces of functions. Chen et al. (2010) also handles locality-sensitive hashing in function spaces by using LSH over the weak learners generated using AdaBoost. In both papers, the hash functions that are presented are fairly restrictive and only apply in some unique circumstances. In contrast, this paper contributes two different methods for constructing locality-sensitive hash functions on many different measures of similarity, over a much larger class of function spaces.
Our general approach is to create an embedding that preserves the distance between functions with minimal distortion. After achieving this, LSH functions for a variety of similarities (e.g. distance and cosine similarity) can be used to hash functions by hashing .
We present two methods in this vein that can be used to extend LSH functions for to function spaces:
In the special case of , we hash by approximating functions in an orthonormal basis in quasilinear time.
For the more general case of all (including the case where one does not have a sufficiently convenient orthonormal basis for the domain and measure ), we use (quasi-)Monte Carlo methods to embed in . We achieve or error in time linear in .
In both methods, the embedding has error inversely correlated with , with the guarantee that as the error converges to zero. We will see that can be increased as needed, so that we can achieve arbitrarily small error in our embeddings (and hence better hash functions).
3.1 Approximation in an orthonormal basis
Start by considering the case of hashing elements of . If is an orthonormal basis for (e.g. a wavelet basis), then the mapping from to given by is a Hilbert space isomorphism between between and .
Suppose we truncate this mapping for a function after terms. If is sufficiently large, then we have the approximation (to be made precise later)
where the right-hand side of the equation above is close to in -norm. Now let be some integer greater than , and define as
Then approximately preserves and in the case that :
As long as we choose for all functions in our dataset, then , as defined above, is an approximate embedding of in .
Using orthonormal bases to compute hashes
To hash , we first map to , and then apply a locality-sensitive hash on for whatever similarity we are interested in. In theory, may be extremely large; however, we can use the fact that is zero in its last coefficients to significantly accelerate the process of hashing .
As an example, we will consider the case of extending the -distance hash of Datar et al. (2004) to hashing . This hash is computed for a vector in as
where has i.i.d. entries randomly sampled from the standard normal distribution , , and is a positive number chosen by the user. To extend this hash to , we will simply hash , since .
Instead of generating all coefficients of — which may require a massive amount of memory, and requires us to place an upper bound on — we lazily generate new coefficients of when we encounter a new input for which is greater than the length of . This approach is used in the pseudocode shown in Algorithm 1, which demonstrates the construction and usage of a locality-sensitive hash function for distance.
The sparsity pattern of makes computing dot products like efficient. Since is zero in its last coefficients, .
In addition, this sparsity pattern also means that we never need to know the full vector . Instead, we can just append new randomly generated coefficients to when we encounter a new largest value of .
Let and be the errors made by approximating and with a finite number of basis elements. We have the following bounds on the error induced by the embedding of in :
In other words, if and are chosen such that and are both size for some , then the absolute error of in approximating is . Meanwhile, the absolute error in approximating is .
We will use the -distance hash from Datar et al. (2004) as an example for how this error can impact the probability of hash collision. The hash collision probability presented in that paper for two inputs and is
where is a user-defined parameter, , and is the p.d.f. of the absolute value of the underlying -stably distributed random variable.
Suppose that and are such that and are both . Let (where is the -distance hash function for ) and let . Then we have the following bounds on the probability of a hash collision between and :
The hash collision probability is bounded above by
and below by
where is the collision probability when .
Proof of Theorem 3.1
The proof of these inequalities is just an application of Hölder’s theorem. Since is monotone decreasing in , the hash collision probability is bounded above by for , and below by for . Thus we have the upper bound
and the lower bound
To compute the upper bound, we used the inequality
which is a result of Hölder’s inequality and the fact that is a probability distribution function. If we instead use Hölder’s inequality as
gives us a second upper bound on the collision probability,
Applying the same trick with the integral
leads to a second lower bound on the hash collision probability:
By combining both of the upper bounds and both of the lower bounds, we get the bounds shown in Theorem 3.1.
Note that these are fairly generous bounds – for instance, is generally much less than . Nonetheless, they demonstrate that approaches at a rate of at least or as .
Note on choosing and computing
There are two unaddressed issues in our previous discussion: (i) it is unclear how we choose for a function , and (ii) the inner products may be expensive to calculate, especially if we have to perform some kind of quadrature.
Choosing : in practice we will combine various heuristics to select a good for which we believe is a good approximation to . For instance, in Section 4 we use Chebyshev polynomials to perform function approximation. Although we choose fixed for demonstration purposes, Trefethen (2012) and Driscoll et al. (2014) both describe inequalities and heuristics that can be used to choose a good degree of Chebyshev polynomial (i.e. a good choice of ) to approximate a function. These bounds are often in terms of approximation in the uniform norm, thus for a bounded domain give a bound in the norm. In the case when is known or can be estimated, then can be explicitly computed (since is computable).
Computing : we will generally not compute exactly, but rather sample the function at points and compute some fast unitary transform on those samples to interpolate them by the basis . For instance, as part of computing the Chebyshev polynomial coefficients used in Section 4, we perform a discrete cosine transform on sampled at certain nodes on the real line. With this approach we don’t perfectly extract the coefficients , but we get good approximations to them that improves as . For the case of Chebyshev polynomials and smooth functions , this error often reaches very high precision with even moderate (e.g. ).
3.2 Monte Carlo methods for function LSH
Our second method for hashing functions generalizes to arbitrary function spaces of finite volume. It comes from the observation that by the theory of Monte Carlo integration,
In this expression, and , and is the volume of . The are sampled at random from under the probability measure . It can be shown similarly that . We can thus view the transform as an approximate embedding of in .
Naturally, we can extend this idea to the more general class of quasi-Monte Carlo methods to develop other schemes for constructing and . For instance, instead of sampling the points i.i.d. under the probability measure , we could sample them as a low-discrepancy sequence, e.g. as a Sobol sequence.
Using Monte Carlo to compute hashes
Since the transform is an approximate isomorphism between function space and when is sufficiently large, we can use many common hash functions for in by applying them to . We can summarize the hashing process in three steps:
Sample points at random from (with distribution dependent on the type of Monte Carlo method you wish to apply).
For a similarity of interest on , sample a new hash function from relevant LSH family.
When given a new function , sample it at through to generate the vector . Apply to .
Suppose that the points are sampled with distribution from . For sufficiently large , is roughly normally distributed via the Law of Large Numbers. This normal distribution has mean and variance
Meanwhile, for large the scaled inner product is also approximately normally distributed with mean and variance
These equations suggest that our error will be of order . Using quasi-Monte Carlo methods (i.e., by changing our sampling scheme so that we sample from a low-discrepancy sequence), we can achieve an error of Lemieux (2009) (where is the dimension of ), which may be significantly better than plain Monte Carlo in lower dimensions.
4 Numerical experiments
To validate the methods described in Section 4, we ran the following numerical experiments:
measuring hash collision rates for function LSH over cosine similarity;
measuring hash collision rates for function LSH over distance; and
observing the effectiveness of using function LSH for -Wasserstein distance, using the distance formulation of 1D Wasserstein distance in equation (2.2).
We find that in all three experiments, the observed collision rates track closely with the theoretical collision probabilities for the hash function that we are extending from to .
For the function approximation method, we used the Chebyshev polynomial basis (which, with a change of variables, can be made a basis for with Lebesgue measure). For both methods, we generated 1,024 hash functions in order to measure the average collision probability between a given pair of inputs. We converted each function to a vector in using the two methods described in Section 3 before hashing them in order to make it easier to compare the effectiveness of both methods. For both methods, this essentially amounts to sampling each function in 64 different locations.
For all experiments, we take . In the second and third experiments, which use the -distance hash, we choose the hyperparameter (from Equation (5)) to be for demonstration purposes.
LSH over cosine similarity
For our first experiment, we used both of our function hashing methods on pairs of randomly generated sine functions and , where , since in this parametric form, the true value of can be computed via a closed-form integral. After converting and into vectors in using the two methods described in Section 3, we hash them using SimHash Charikar (2002), whose collision probability is
This theoretical probability is plotted against the observed collision frequencies in Figure 1.
LSH over distance
For our second experiment, we again sample pairs of random sine waves and used the function approximation- and Monte Carlo-based methods to convert the functions to vectors in . The collision probability for the -distance hash of Datar et al. (2004) is
where and is a user-selected parameter. It follows that when we apply this hash to our vectors in , we expect their collision probability to follow the same distribution (except with replaced by ). This is borne out by the observed collision rates shown in Figure 2.
For our third experiment, we compare pairs of one-dimensional normal distributions on their second-order Wasserstein distance. We choose to measure the distance between normal distributions because every pair of Gaussians and (with means and and covariance matrices and ) has the following convenient closed-form expression for Olkin & Pukelsheim (1982):
For a pair of 1D Gaussians and , this reduces to
For our experiment, we repeatedly generated pairs of Gaussians, each with means randomly sampled from and variances sampled from . To hash the distributions, we used the expression in Equation (2.2) by hashing
Similarity search over spaces of functions is a very computationally intensive task. Our study has extended multiple locality-sensitive hash functions from to the much more general function spaces. These methods can be made arbitrarily precise (i.e., we can get arbitrarily close to the collision probabilities guaranteed by the LSH families in ) in exchange for a little more computational effort. From this, the function hashing techniques described in this paper have made the problem of similarity search over function spaces significantly more tractable.
Although we have primarily discussed the cosine similarity hash of Charikar (2002) and the distance hash of Datar et al. (2004), the methods presented in this paper can in theory be used to extend any hash function for a similarity over that has an analogous definition in . Of particular interest are the hash functions for maximum inner product search of Shrivastava & Li (2014) and Shrivastava & Li (2015). Such a hash function could be used as a primitive in defining hash functions for other similarities. For instance, similarity search based on KL divergence can be re-expressed as a maximum inner product search problem, based on the fact that
where the proportionality coefficient is constant for fixed .
The techniques described in this paper can also be applied in broader input spaces than . The function approximation-based approach of Section 3.1 can be used to hash any separable Hilbert space in which we have identified an orthonormal basis (or, at the very least, can implicitly compute the inner products ). In addition, the Monte Carlo approach can be used on arbitrary sets of functions defined over any finite-volume measure space (including those for which , so long as we have a way of sampling functions in this space).
- The inverse c.d.fs are at and at , so we experience some numerical difficulties trying to approximate them by Chebyshev polynomials. To avoid this issue, we only hashed the portion of the inverse c.d.f. living on the interval (instead of ), which empirically still performed well in generating a frequency of hash collisions close to the theoretical probability of collision.
- A closed-form expression for these inverse c.d.fs does not exist, but this is not an issue because in our experiments we only need to be able to sample these c.d.fs at 64 points in order to hash them.
- Charikar, M. S. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC â02, pp. 380â388, New York, NY, USA, 2002. Association for Computing Machinery. doi: 10.1145/509907.509965.
- Chen, S., Wang, J., Liu, Y., Xu, C., and Lu, H. Fast feature selection and training for adaboost-based concept detection with large scale datasets. In Proceedings of the 18th ACM International Conference on Multimedia, MM â10, pp. 1179â1182, New York, NY, USA, 2010. Association for Computing Machinery. ISBN 9781605589336. doi: 10.1145/1873951.1874181.
- Datar, M., Indyk, P., Immorlica, N., and Mirrokni, V. Locality-sensitive hashing scheme based on p-stable distributions. 01 2004. doi: 10.1145/997817.997857.
- Driscoll, T. A., Hale, N., and Trefethen, L. N. Chebfun guide, 2014.
- Indyk, P. and Thaper, N. Fast image retrieval via embeddings. 2003.
- Lemieux, C. Monte Carlo and Quasi-Monte Carlo Sampling. Springer, 2009. ISBN 978-1441926760.
- Lv, Q., Josephson, W., Wang, Z., Charikar, M., and Li, K. Multi-probe lsh: Efficient indexing for high-dimensional similarity search . pp. 950–961, 01 2007.
- Olkin, I. and Pukelsheim, F. The distances between two random vectors with given dispersion matrices. Linear Algebra and its Applications, 48:257–263, 12 1982. doi: 10.1016/0024-3795(82)90112-4.
- Peleg, S., Werman, M., and Rom, H. A unified approach to the change of resolution: Space and gray-level. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 11:739– 742, 08 1989. doi: 10.1109/34.192468.
- Santambrogio, F. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.
- Shrivastava, A. and Li, P. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPSâ14, pp. 2321â2329, Cambridge, MA, USA, 2014. MIT Press.
- Shrivastava, A. and Li, P. Improved asymmetric locality sensitive hashing (alsh) for maximum inner product search (mips). In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAIâ15, pp. 812â821, Arlington, Virginia, USA, 2015. AUAI Press. ISBN 9780996643108.
- Tang, Y.-K., Mao, X.-L., Hao, Y.-J., Xu, C., and Huang, H. Locality-sensitive hashing for finding nearest neighbors in probability distributions. pp. 3–15, 10 2017. ISBN 978-981-10-6804-1. doi: 10.1007/978-981-10-6805-8˙1.
- Trefethen, L. N. Approximation Theory and Approximation Practice (Other Titles in Applied Mathematics). Society for Industrial and Applied Mathematics, USA, 2012. ISBN 1611972396.