2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search

2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search

Abstract

The method of random projections has become a standard tool for machine learning, data mining, and search with massive data at Web scale. The effective use of random projections requires efficient coding schemes for quantizing (real-valued) projected data into integers. In this paper, we focus on a simple 2-bit coding scheme. In particular, we develop accurate nonlinear estimators of data similarity based on the 2-bit strategy. This work will have important practical applications. For example, in the task of near neighbor search, a crucial step (often called re-ranking) is to compute or estimate data similarities once a set of candidate data points have been identified by hash table techniques. This re-ranking step can take advantage of the proposed coding scheme and estimator.

As a related task, in this paper, we also study a simple uniform quantization scheme for the purpose of building hash tables with projected data. Our analysis shows that typically only a small number of bits are needed. For example, when the target similarity level is high, 2 or 3 bits might be sufficient. When the target similarity level is not so high, it is preferable to use only 1 or 2 bits. Therefore, a 2-bit scheme appears to be overall a good choice for the task of sublinear time approximate near neighbor search via hash tables.

Combining these results, we conclude that 2-bit random projections should be recommended for approximate near neighbor search and similarity estimation. Extensive experimental results are provided.

\@testdef

undefined

2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search


Ping Li
Department of Statistics
Dept. of Computer Science
Rutgers University
Piscataway, NJ 08854, USA

pingli@stat.rutgers.edu
Michael Mitzenmacher
School of Engineering and Applied Sciences
Harvard University
Cambridge, MA 02138, USA

michaelm@eecs.harvard.edu
Anshumali Shrivastava
Dept. of Computer Science
Rice University
Houston, TX 77005, USA

anshumali@rice.edu


\@float

copyrightbox[b]

\end@float

Computing (or estimating) data similarities is a fundamental task in numerous practical applications. The popular method of random projections provides a potentially effective strategy for estimating data similarities (correlation or Euclidian distance) in massive high-dimensional datasets, in a memory-efficient manner. Approximate near neighbor search is a typical example of those applications.

The task of near neighbor search is to identify a set of data points which are “most similar” (in some measure of similarity) to a query data point. Efficient algorithms for near neighbor search have numerous applications in search, databases, machine learning, recommender systems, computer vision, etc. Developing efficient algorithms for finding near neighbors has been an active research topic since the early days of modern computing [?]. Near neighbor search with extremely high-dimensional data (e.g., texts or images) is still a challenging task and an active research problem.

In the specific setting of the World Wide Web, the use of hashing and random projections for applications such as detection of near-duplicate Web pages dates back to (e.g.,) [?, ?]. The work in this area has naturally continued, improved, and expanded; see, for example,  [?, ?, ?, ?, ?, ?, ?, ?, ?, ?] for research papers with newer results on the theoretical frameworks, performance, and applications for such methods. In particular, such techniques have moved beyond near-duplicate detection and retrieval to detection and retrieval for more complex data types, including images and videos. Our work continues on this path; specifically, we seek to obtain accurate similarity scores using very small-memory random projections, for applications where the goal is to determine similar objects, or equivalently nearest neighbors in a well-defined space.

Among many types of similarity measures, the (squared) Euclidian distance (denoted by ) and the correlation (denoted by ) are most commonly used. Without loss of generality, consider two high-dimensional data vectors . The squared Euclidean distance and correlation are defined as follows:

The correlation is nicely normalized between -1 and 1. For convenience, this study will assume that the marginal norms and are known. This is a often reasonable assumption [?], as computing the marginal norms only requires scanning the data once, which is anyway needed during the data collection process. In machine learning practice, it is common to first normalize the data before feeding the data to classification (e.g., SVM) or clustering (e.g., K-means) algorithms. Therefore, for convenience, throughout this paper, we assume unit norms:

As an effective tool for dimensionality reduction, the idea of random projections is to multiply the data, e.g., , with a random normal projection matrix , to generate:

This method has become popular for large-scale machine learning applications such as classification, regression, matrix factorization, singular value decomposition, near neighbor search, bio-informatics, etc. [?, ?, ?, ?, ?, ?, ?, ?, ?].

The projected data (, ) are real-valued. For many applications it is however crucial to quantize them into integers. The quantization step is in fact mandatary if the projected data are used for the purpose of indexing and/or sublinear time near neighbor search (e.g.,) in the framework of locality sensitive hashing (LSH) [?].

Another strong motivation for quantization is for reducing memory consumption. If only a few (e.g., 2) bits suffice for producing accurate estimate of the similarity, then we do not need to store the entire (e.g., 32 or 64 bits) real-valued projection data. This would be a very significant cost-saving in storage as well as computation.

In this paper, we focus on 2-bit coding and estimation for multiple reasons. As analyzed in Section 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search, the 2-bit coding appears to provide an overall good scheme for building hash tables in near neighbor search. The focus of this paper is on developing accurate nonlinear estimators, which are typically computationally quite expensive. Fortunately, for 2-bit coding, it is still feasible to find the numerical solution fairly easily, for example, by tabulation.

Given two (high-dimensional) data vectors , we generate two projected values and as follows:

Assuming that the original data , are normalized to unit norm, the projected data follow a bivariate normal distribution:

(1)

Note that when using random projections in practice, we will need (e.g.,) independent projections, depending on applications; and we will use , , to , to denote them.

As the projected data are real-valued, we will have to quantize them either for indexing or for achieving compact storage. Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search pictures the 2-bit coding scheme after random projections. Basically, a random projection value is mapped to an integer according to a threshold (and ).

Figure \thefigure: 2-bit random projections.

As shown in Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search, the space is divided into 16 regions according to the pre-determined threshold . To fully exploit the information, we need to jointly analyze the probabilities in all 16 regions. We will see that the analysis is quite involved.

The first step of the analysis is to compute the probability of each region. Fortunately, due to symmetry (and asymmetry), we just need to conduct the computations for three regions:

Due to symmetry, the probabilities of other regions are simply

We use the following standard notation for the normal distribution pdf and cdf :

After some tedious calculations (which are skipped), the probabilities of the three regions are

Their first derivatives (with respect to ) are

Their second derivatives are

Because is bounded, we can tabulate the above probabilities and their derivatives for the entire range of and selected values. Note that in practice, we anyway have to first specify a . In other words, the computations of the probabilities and derivatives are a simple matter of efficient table look-ups.

Suppose we use in total projections. Due to symmetry (as shown in Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search), the log-likelihood is a sum of 6 terms (6 cells).

Corresponding to Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search, is the number of observations (among observations) in the region (1,1). , etc are defined similarity. Note that there is a natural constraint:

In other words, this 6-cell problem only has 5 degrees of freedom. In fact, we can also choose to collapse some cells together to reduce this to an even smaller problem. For example, later we will show that if we reduce the 6-cell problem to a 5-cell problem, the estimation accuracy will not be affected much.

There are more than one way to solve the MLE which maximizes the likelihood , for finding . Note that this is merely a one-dimensional optimization problem (at a fixed ) and we can tabulate all the probabilities (and their derivatives). In other words, it is not a difficult problem. We can do binary search, gradient descent, Newton’s method, etc. Here we provide the first and second derivatives of . The first derive is

and the second derivative is

If we use Newton’s method, we can find the solution iteratively from , by starting from a good guess, e.g., the estimate using 1-bit information. Normally a small number of iterations will be sufficient. Recall that these derivatives and second derivatives are pre-computed and stored in look-up tables.

For this particular 2-bit coding scheme, it is possible to completely avoid the numerical procedure by further exploiting look-up table tricks. Suppose we tabulate the MLE results for each , spaced at 0.01. Then a 6-cell scheme would only require space, which is not too large. (Recall there are only 5 degrees of freedom). If we adopt a 5-cell scheme, then the space would be reduced to . Of course, if we hope to use more than 2 bits, then we can not avoid numerical computations.

The asymptotic (for large ) variance of the MLE (i.e., the which maximizes the log likelihood ) can be computed from classical statistical estimation theory. Denote the MLE by . Then its asymptotic variance should be

(2)

where is the Fisher Information.

Theorem 1

The Fisher Information is

(3)

Proof: We need to compute . Because the expectation , the expression can be simplified substantially. Then we take advantage of the fact that , , to obtain the desired result.

While the expressions appear sophisticated, the Fisher Information and variance can be verified by simulations; see Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search.

A linear estimator only uses the information whether the code of equals the code of . In other words, linear estimators only use the diagonal information in Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search. With a 2-bit scheme, can be estimated from counts in collapsed cells, by solving for from

which still requires a numerical procedure (or tabulation). The analysis of the linear estimator was done in [?], and can also be inferred from the analysis of the nonlinear estimator in this paper.

This special case can be derived from the results of 2-bit random projections by simply letting . The estimator, by counting the observations in each quadrant, has a simple closed-form [?, ?], i.e., . The Fisher Information of estimator, denoted by , is then

The ratio

(4)

characterizes the reduction of variance by using the 2-bit scheme and the MLE, as a function of and .

We provide the following Theorem, to show that the ratio is close to 2 when . Later we will see that, for high similarity regions, the ration can be substantially higher than 2.

Theorem 2

For (4) and , we have ,

(5)

Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search shows that has a unique maximum = 1.3863 (i.e., maximum of is 1.9218), attained at .

Figure \thefigure: The curve of as defined in (5).

The performance depends on (and ). In practice, we need to pre-specify a value of for random projections and we have to use the same for all data points because this coding process is non-adaptive. Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search and Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search plot the ratio (left panels) for selected values, confirming that should be an overall good choice. In addition, we present some additional work in the right panels of Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search and Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search to show that if we collapse some cells appropriately (from a 6-cell model to a 5-cell model), the performance would not degrade much (not at all for high similarity region, which is often more interesting in practice).

According to Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search, we collapse the three cells (0,3), (0,2), and (1,3) into one cell. Note that (0,2) and (1,3) have the same probabilities and are already treated as one cell. Due to symmetry, the other three cells (3,0), (2,0), and (3,1) are also collapsed into one. This way, we have in total 5 distinct cells. The intuition is that if we are mostly interested in high similar regions, most of the observations will be falling around the diagonals. This treatment simplifies the estimation process and does not lead to an obvious degradation of the accuracy at least for high similarity regions, according to Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search and Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search.

Figure \thefigure: The ratio (4) at , which characterizes the improvement of the MLE () over the 1-bit estimator . It looks provides an overall good trade-off. The problem is a 6-cell (ie., left panel) contingency table estimation problem. To demonstrate the simplification of the process by using 5 cells (see the main text for the description of the procedure), we also include the same type of improvements for using the reduced 5-cell model in the right panel.

Figure \thefigure: The ratio (4) at , to show is an overall good trade-off. There is no space to label but the order of curves should be a good indicator. We plot in red, if color is available.

Figure \thefigure: Mean square errors (MSE) from the simulations to verify the nonlinear MLE. The empirical MSEs essentially overlap the asymptotic variances predicted by the Fisher information (3), confirming the theoretical results. In addition, we also plot the empirical MSEs of the 1-bit estimator to verify the substantial improvement of the MLE.

As presented in Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search, a simulation study is conducted to confirm the theoretical results of the MLE, for a wide range of values. The plots confirm that the MLE substantially improves the 1-bit estimator, even at low similarities. They also verify that the theoretical asymptotic variance predicted by the Fisher Information (3) is accurate, essentially no different from the empirical mean square errors. We hope this experiment might help readers who are less familiar with the classical theory of Fisher Information.

In this section, we review two common coding strategy: (i) the scheme based on windows + random offset; (ii) the scheme based on simple uniform quantization. Note that both of them are strictly speaking infinite-bit coding schemes, although (ii) can be effectively viewed as a finite-bit scheme.

[?] proposed the following well-known coding scheme, which uses windows and a random offset:

(6)

where , is the bin width and is the standard floor operation. [?] showed that the collision probability can be written as a monotonic function of the Euclidean distance:

where is the distance between and .

A simpler (and in fact better) scheme than (6) is based on uniform quantization without offset:

(7)

The collision probability for (7) is

is a monotonically increasing function of .

The fact that is monotonically increasing in makes (7) an appropriate coding scheme for approximate near neighbor search under the general framework of locality sensitive hashing (LSH). Note that while appears sophisticated, the expression is just for the analysis. Without using the offset, the scheme (7) itself is operationally simpler than the popular scheme (6).

In the prior work, [?] studied the coding scheme (7) in the context of similarity estimation using linear estimators with application to building large-scale linear classifiers. In this paper, we conduct the study of (7) for sublinear time near neighbor search by building hash tables from coded projected data. This is a very different task from similarity estimation. Moreover, much of the space of the paper is allocated to the design and analysis of nonlinear estimators which are very useful in the “re-ranking” stage of near neighbor search after the potentially similar data points are retrieved.

There is another important distinction between (7) and (6). By using a window and a random offset, (6) is actually an “infinite-bit” scheme. On the other hand, with only a uniform quantization, (7) is essentially a finite-bit scheme, because the data are normalized and the Gaussian (with variance 1) density decays very rapidly at the tail. If we choose (e.g.,) (note that ), we essentially have a 1-bit scheme (i.e., by recording the signs of the projected data), because the analysis can show that using is not essentially different from using . Note that the 1-bit scheme [?, ?] is also known as “sim-hash” in the literature.

In this paper, we will show, through analysis and experiment, that often a 2-bit scheme (i.e., a uniform quantization with ) is better for LSH (depending on the data similarity). Moreover, we have developed nonlinear estimators for 2-bit scheme which significantly improve the estimator using the 1-bit scheme as well as the linear estimator using the 2-bit scheme.

In this section, we compare the two coding schemes in Section 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search: (i) the scheme based on windows + random offset, i.e., (6); (ii) the scheme based on simple uniform quantization, i.e., (7), in the setting of approximate near neighbor search. We will show that (7) is more effective and in fact only a small number of bits are needed.

Consider a data vector . Suppose there exists another vector whose Euclidian distance () from is at most (the target distance). The goal of -approximate -near neighbor algorithms is to return data vectors (with high probability) whose Euclidian distances from are at most with .

Recall that, in our definition, is the squared Euclidian distance. To be consistent with the convention in [?], we present the results in terms of . Corresponding to the target distance , the target similarity can be computed from i.e., . To simplify the presentation, we focus on (as is common in practice), i.e., . Once we fix a target similarity , can not exceed a certain value:

For example, when , we must have .

The performance of an LSH algorithm largely depends on the difference (gap) between the two collision probabilities and (respectively corresponding to and ):

The probabilities and are analogously defined for .

A larger difference between and implies a more efficient LSH algorithm. The following “” values ( for and for , respectively) characterize the gaps:

(8)

A smaller (i.e., larger difference between and ) leads to a potentially more efficient LSH algorithm and is particularly desirable [?]. The general theory of LSH says the query time for -approximate -near neighbor is dominated by distance evaluations, where is the total number of data vectors in the collection. This is better than , the cost of a linear scan.

Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search compares with at their “optimum” values, as functions of , for a wide range of target similarity levels. Basically, at each and , we choose the to minimize and the to minimize . This figure illustrates that is smaller than , noticeably so in the low similarity region.

Figure \thefigure: Comparison of the optimum gaps (smaller the better) for and . For each and , we can find the smallest gaps individually for and , over the entire range of . For all target similarity levels , both and exhibit better performance than . always has smaller gap than , although in high similarity region both perform similarly.

Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search and Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search present and as functions of , for and , respectively. In each figure, we plot the curves for a wide range of values. These figures illustrate where the optimum values are obtained. Clearly, in the high similarity region, the smallest values are obtained at low values, especially at small . In the low (or moderate) similarity region, the smallest values are usually attained at relatively large .

Figure \thefigure: The gaps and as functions of , for . The lowest points on the curves are reflected in Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search.

Figure \thefigure: The gaps and as functions of for 

Figure \thefigure: The gaps and as functions of , for . In each panel, we plot and for one value.

Figure \thefigure: The gaps and as functions of , for . In each panel, we plot and for one value.

Figure \thefigure: Upper panels: the optimal (smallest) gaps at given values and the entire range of . We can see that is always larger than , confirming that it is better to use instead of . Bottom panels: the optimal values of at which the optimal gaps are attained. When the target similarity is very high, it is best to use a relatively small .

In practice, we normally have to pre-specify the bin width , for all and values. In other words, the “optimum” values presented in Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search are in general not attainable. Thus, Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search and Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search present and as functions of , for and , respectively. In each figure, we plot the curves for a wide range of values. These figures again confirm that is smaller than , i.e., the scheme without offset (7) is better.

To view the optimal gaps more closely, Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search plots the best gaps (upper panels) and the optimal values (bottom panels) at which the best gaps are attained, for selected values of . These plots again confirm the previous comparisons:

  • We should always replace with . At any and , the optimal gap is at least as large as the optimal gap . At relatively low similarities, the optimal can be substantially larger than the optimal .

  • If we use and target at very high similarity, a reasonable choice of the bin width might be .

  • If we use and the target similarity is not too high, then we can safely use .

We should also mention that, although the optimal values for appear to exhibit a “jump” in the right panels of Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search, the choice of does not influence the performance much, as shown in previous plots. In Figures 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search and 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search, we have seen that even when the optimal appears to approach “”, the actual gaps are not much different between and . In the real data evaluations in the next section, we will see the same phenomenon for .

Note that the Gaussian density decays very rapidly at the tail, for example, and . If we choose , then we practically just need (at most) 2 bits to code each hashed value, that is, we can simply quantize the data according to (see Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search).

In the process of using hash tables for sublinear time near neighbor search, there is an important step called “re-ranking”. With a good LSH scheme, the fraction of retrieved data points could be relatively low (e.g., ). But the absolute number of retrieved points can still be very large (e.g., of a billion points is still large). It is thus crucial to have a re-ranking mechanism, for which one will have to either compute or estimate the actual similarities.

When the original data are massive and high-dimensional, i.e., a data matrix in with both and being large, it can be challenging to evaluate the similarities. For example, it is often not possible to load the entire dataset in the memory. In general, we can not store all pair-wise similarities at the cost of space which is not practical even for merely . In addition, the query might be a new data point so that we will have to compute the similarities on the fly anyway. If the data are high-dimensional, the computation itself of the exact similarities can be too time-consuming.

A feasible solution is to estimate the similarities on the fly for re-ranking, from a small projected data stored in the memory. This has motivated us to develop nonlinear estimators for a 2-bit coding scheme, by exploiting full information of the bits.

There are other applications of nonlinear estimators too. For example, we can use random projections and nonlinear estimators for computing nonlinear kernels for SVM. Another example is to find nearest neighbors by random projections (to reduce the dimensionality and data size) and brute-force linear scan of the projected data, which is simple to implement and easy to run in parallel.

Two-stage coding.   Note that the coding scheme for building hash tables should be separate from the coding scheme for developing accurate estimators. Once we have projected the data and place the points into the buckets using a designated coding scheme, we can actually discard the codes. In other words, we can code the same projected data twice. In the second time, we store the codes of (a fraction of) the projected data for the task of similarity estimation.

We conduct a set of experimental study for LSH and re-ranking to demonstrate the advantage of the proposed nonlinear estimator for the 2-bit coding scheme. Again, we adopt the standard -LSH scheme [?]. That is, we concatenate (independent) hash functions to build each hash table and we independently build such hash tables. Note that here we use the capital letter to differentiate it from , which we use for sample size (or number of projections) in the context of similarity estimation.

We have showed that, for building hash tables, it is good to use uninform quantization with bin width (e.g.,) if the target similarity is high and if the target similarity is not so high. Here we use to indicate that it is the bin width for building hash tables. For simplicity, we fix (for table building) and (for similarity estimation). We choose and . The results (especially the trends) we try to present are not too sensitive to those parameters and .

Once we have built the hash tables, we need to store a fraction of the coded projected data. To save space, we should store projections. Here we choose and , which appear to be sufficient to provide accurate estimates of the similarity for re-ranking of retrieved data points.

We target at top- nearest neighbors, for . We re-rank the retrieved points according to estimated similarities based on 3 different estimators: (i) the MLE (nonlinear) for 2-bit coding as studied in this paper; (ii) the 2-bit linear estimator; (iii) the 1-bit estimator. We present the results in terms of precision-recall curves (higher is better) for retrieving the top- points. That is, we first rank all retrieved points according to estimated similarities. Then for a particular , we examine the top- of the list to compute one (precision, recall) tuple. By varying , we obtain a precision-recall curve for each , averaged over all query points.

As shown in Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search, Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search, and Figure 2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search, in all our experiments, we see that the 2-bit MLE substantially improves the 2-bit linear estimator, which substantially improves the 1-bit estimator.





Figure \thefigure: Youtube: precision-recall curves (higher is better) for retrieving the top-10, -20, -50, -100 nearest neighbors using standard -LSH scheme and 3 different estimators of similarities (for the retrieved data points). The Youtube dataset is a subset from the publicly available UCL-Youtube-Vision dataset. We use 97,934 data points for building hash tables and 5,000 data points for the query. The results are averaged over all the query points. In the LSH experiments, we fix and (upper two layers) and (bottom two layers). We estimate the similarities using two different sample sizes, for and . We can see that for any combinations of parameters, the nonlinear MLE (labeled as “MLE”) always substantially improves the 2-bit linear estimator (labeled as “2-bit”), which substantially improves the 1-bit estimator (labeled as “1-bit”).

Figure \thefigure: Peekaboom: precision-recall curves (higher is better) for retrieving the top-10, -20, -50, -100 nearest neighbors using standard -LSH scheme and 3 different estimators of similarities (for the retrieved data points). Peekaboom is a standard image retrieval dataset with 20,019 data points for building the tables and 2,000 data points for the query.

Figure \thefigure: LabelMe: precision-recall curves (higher is better.) for retrieving the top-10, -20, -50, -100 nearest neighbors using standard -LSH scheme and 3 different estimators of similarities (for the retrieved data points). LabelMe is a standard image retrieval dataset with 55,599 data points for building the tables and 1,998 data points for the query.

The method of random projections is a standard tool for many data processing applications which involve massive, high-dimensional datasets ( which are common in Web search and data mining). In the context of approximate near neighbor search by building hash tables, it is mandatary to quantize (code) the projected into integers. Prior to this work, there are two popular coding schemes: (i) an “infinite-bit” scheme [?] by using uniform quantization with a random offset; and (ii) a “1-bit” scheme [?, ?] by using the signs of the projected data. This paper bridges these two strategies.

In this paper, we show that, for the purpose of building hash tables in the framework of LSH, using uniform quantization without the offset leads to improvement over the prior work [?]. Our method only needs a small number of bits for coding each hashed value. Roughly speaking, when the target similarity is high (which is often interesting in practice), it is better to use 2 or 3 bits. But if the target similarity is not so high, 1 or 2 bits often suffice. Overall, we recommend the use of a 2-bit scheme for LSH. Not surprisingly, as an additional benefit, using 2-bit scheme typically halves the preprocessing cost compared to using the 1-bit scheme.

For approximate near neighbor search, an important (and sometimes less well-discussed) step is the “re-ranking”, which is needed in order to identify the truly similar data points among the large number of candidates retrieved from hash tables. This re-ranking step requires a good estimator of the similarity, because storing the pre-computed all pairwise similarities is normally not feasible and computing the exact similarities on the fly can be time-consuming especially for high-dimensional data. In this paper, we propose the use of nonlinear estimators and we analyze the 2-bit case with details. Although the analysis appears sophisticated, the estimation procedure is computationally feasible and simple, for example, by tabulations. Compared to the standard 1-bit and 2-bit linear estimators, the proposed nonlinear estimator significantly improves the accuracy, both theoretically and empirically.

In summary, our paper advances the state-of-the-art of random projections in the context of approximate near neighbor search.

  • [1] Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all pairs similarity search. In WWW, pages 131–140, 2007.
  • [2] Ella Bingham and Heikki Mannila. Random projection in dimensionality reduction: Applications to image and text data. In KDD, pages 245–250, San Francisco, CA, 2001.
  • [3] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. In WWW, pages 1157 – 1166, Santa Clara, CA, 1997.
  • [4] Jeremy Buhler and Martin Tompa. Finding motifs using random projections. Journal of Computational Biology, 9(2):225–242, 2002.
  • [5] Michael A Casey and Malcolm Slaney. Song intersection by approximate nearest neighbor search. In ISMIR, volume 6, pages 144–149, 2006.
  • [6] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380–388, Montreal, Quebec, Canada, 2002.
  • [7] Sanjoy Dasgupta. Learning mixtures of gaussians. In FOCS, pages 634–644, New York, 1999.
  • [8] Sanjoy Dasgupta. Experiments with random projection. In UAI, pages 143–151, Stanford, CA, 2000.
  • [9] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokn. Locality-sensitive hashing scheme based on -stable distributions. In SCG, pages 253 – 262, Brooklyn, NY, 2004.
  • [10] Songyun Duan, Achille Fokoue, Oktie Hassanzadeh, Anastasios Kementsietsidis, Kavitha Srinivas, and Michael J. Ward. Instance-based matching of large ontologies using locality-sensitive hashing. In Proceedings of the 11th International Conference on The Semantic Web - Volume Part I, pages 49–64, 2012.
  • [11] Dmitriy Fradkin and David Madigan. Experiments with random projections for machine learning. In KDD, pages 517–522, Washington, DC, 2003.
  • [12] Jerome H. Friedman, F. Baskett, and L. Shustek. An algorithm for finding nearest neighbors. IEEE Transactions on Computers, 24:1000–1006, 1975.
  • [13] Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of ACM, 42(6):1115–1145, 1995.
  • [14] Hannaneh Hajishirzi, Wen-tau Yih, and Aleksander Kolcz. Adaptive near-duplicate detection via similarity learning. In SIGIR, pages 419–426, 2010.
  • [15] Monika Rauch Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284–291, 2006.
  • [16] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604–613, Dallas, TX, 1998.
  • [17] William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mapping into Hilbert space. Contemporary Mathematics, 26:189–206, 1984.
  • [18] Weihao Kong, Wu-Jun Li, and Minyi Guo. Manhattan hashing for large-scale image retrieval. In SIGIR, pages 45–54, 2012.
  • [19] Brian Kulis and Kristen Grauman. Kernelized locality-sensitive hashing for scalable image search. In ICCV, pages 2130–2137, 2009.
  • [20] Cong Leng, Jian Cheng, and Hanqing Lu. Random subspace for binary codes learning in large scale image retrieval. In SIGIR, pages 1031–1034, 2014.
  • [21] Ping Li, Trevor J. Hastie, and Kenneth W. Church. Improving random projections using marginal information. In COLT, pages 635–649, Pittsburgh, PA, 2006.
  • [22] Ping Li and Arnd Christian König. b-bit minwise hashing. In Proceedings of the 19th International Conference on World Wide Web, pages 671–680, Raleigh, NC, 2010.
  • [23] Ping Li, Michael Mitzenmacher, and Anshumali Shrivastava. Coding for random projections. In ICML, 2014.
  • [24] Michael Mitzenmacher, Rasmus Pagh, and Ninh Pham. Efficient estimation for high similarities using odd sketches. In WWW, pages 109–118, 2014.
  • [25] Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. Latent semantic indexing: A probabilistic analysis. In PODS, pages 159–168, Seattle,WA, 1998.
  • [26] Santosh Vempala. The Random Projection Method. American Mathematical Society, Providence, RI, 2004.
  • [27] Fei Wang and Ping Li. Efficient nonnegative matrix factorization with random projections. In SDM, 2010.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
61925
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description