Efficient Sketching Algorithm for Sparse Binary Data

Efficient Sketching Algorithm for Sparse Binary Data

1st Rameshwar Pratap School of Computing and Electrical Engineering
IIT Mandi, H.P.
India
rameshwar@iitmandi.ac.in
   2nd Debajyoti Bera Department of Computer Science
IIIT Delhi
India
dbera@iiitd.ac.in
   3rd Karthik Revanuru NALT Analytics
Bangalore, India
India
karthikrvnr@gmail.com
Abstract

Recent advancement of the WWW, IOT, social network, e-commerce, etc. have generated a large volume of data. These datasets are mostly represented by high dimensional and sparse datasets. Many fundamental subroutines of common data analytic tasks such as clustering, classification, ranking, nearest neighbour search, etc. scale poorly with the dimension of the dataset. In this work, we address this problem and propose a sketching (alternatively, dimensionality reduction) algorithm – (Binary Data Sketch) – for sparse binary datasets. preserves the binary version of the dataset after sketching and maintains estimates for multiple similarity measures such as Jaccard, Cosine, Inner-Product similarities, and Hamming distance, on the same sketch. We present a theoretical analysis of our algorithm and complement it with extensive experimentation on several real-world datasets. We compare the performance of our algorithm with the state-of-the-art algorithms on the task of mean-square-error and ranking. Our proposed algorithm offers a comparable accuracy while suggesting a significant speedup in the dimensionality reduction time, with respect to the other candidate algorithms. Our proposal is simple, easy to implement, and therefore can be adopted in practice. 111A preliminary version of this paper has been accepted at IEEE-ICDM, 2019.

I Introduction

Due to technological advancements, recent years have witnessed a dramatic increase in our ability to collect data from various sources like WWW, IOT, social media platforms, mobile applications, finance, and biology. For example, in many web applications, the volume of datasets are of the terascale order, with trillions of features [1]. The high dimensionality incurs high memory requirements and computational cost during the training. Further, most of such high dimensional datasets are sparse, owing to a wide adaption of “Bag-of-words” (BoW) representations. For example: in the case of document representation, word frequency within a document follows power law – most of the words occur rarely in a document, and higher order shingles occur only once. We focus on the binary representation of the datasets which is quite common in several applications [26, 17].

Measuring similarity score of data points under various similarity measures is a fundamental subroutine in several applications such as clustering, classification, identifying nearest neighbors, ranking, and it plays an important role in various data mining, machine learning, and information retrieval tasks. However, due to the “curse of dimensionality” a brute-force way of computing the similarity scores in the high dimensional dataset is infeasible, and at times impossible. In this work, we address this question and propose an efficient dimensionality reduction algorithm for sparse binary datasets that generates a succinct sketch of the dataset while preserving estimates for computing the similarity score between data objects.

I-a Our Contribution

We first informally describe our sketching algorithm.

BinSketch: (Binary Data Sketching) Given a -dimensional binary vector , our algorithm reduces it to a -dimensional binary vector , where is specified later. It randomly maps each bit position (say) to an integer . To compute the -th bit of , it checks which bit positions have been mapped to , computes the of the bits located at those positions and assigns it to

A simple and exact solution to the problem is to represent each binary vector by a (sorted) list (or vector) of the indices with value one. In this representation, the space required in storing a vector is bits – as we need bits for storing each index, and there are at most indices with non-zero value (sparsity). Further, the time complexity of computing the (say) inner product of two originally -sparse binary vectors is . Therefore, both the storage as well as the time complexity of calculating similarity depend on the original dimension and does not scale for large values of . For high dimensional sparse binary data, we show how to construct highly compressed binary sketches whose length depends only on the data sparsity. Furthermore, we present techniques to compute similarity between vectors from their sketches alone. Our main technique is presented in Algorithm 1 for inner product similarity and the following theorem summarizes it.

Theorem 1 (Estimation of Inner product).

Suppose we want to estimate the Inner Product of -dimensional binary vectors, whose sparsity is at most , with probability at least . We can use to construct -dimensional binary sketches where . If and denote the sketches of vectors and , respectively, then can be estimated with accuracy using Algorithm 1.

We also present Algorithm 2 for estimating Hamming distance, Algorithm 3 for estimating Jaccard similarity and Algorithm 4 for estimating Cosine similarity; all these algorithms are designed based on Algorithm 1 and so follow similar accuracy guarantees.

Extension for categorical data compression. Our result can be easily extended for compressing Categorical datasets. The categorical dataset consists of several categorical features. Examples of categorical features are sex, weather, days in a week, age group, educational level, etc. We consider a type of Hamming distance for defining the distance between two categorical data points. For two dimensional categorical data points and , the distance between them is defined as follows: , where

In order to use , we need to preprocess the datasets. We first encode categorical feature via label-encoding followed by one-hot-encoding. In the label encoding step, features are encoded as integers. For a given feature, if it has possible values, we encode them with integers between and . In one-hot-encoding step, we convert the feature value into a length binary string, where is located at the position corresponding to the result of the label-encoding step. 222Both label-encoder and one-hot-encoder are available in sklearn as labelEncoder and OneHotEncoder packages. This preprocessing converts categorical dataset to a binary dataset. Please note that after preprocessing Hamming distance between the binary version of the data points is equal to the corresponding categorical distance , stated above. We can now compress the binary version of the dataset using and due to Algorithm 2, the compressed representation maintains the Hamming distance.

In Section III we present the proof of Theorem 1 where we explain the theoretical reasons behind the effectiveness of . As is usually the case for hash functions, practical performance often outshines theoretical bounds; so we conduct numerous experiments on public datasets. Based on our experiment results reported in Section IV we make the claim that is the best option for compressing sparse binary vectors while retaining similarity for many of the commonly used measures. The accuracy obtained is comparable with the state-of-the-art sketching algorithms, especially at high similarity regions, while taking almost negligible time compared to similar sketching algorithms proposed so far.

I-B Related work

Our proposed algorithm is very similar in nature to the BCS algorithm [21, 22], which suggests a randomized bucketing algorithm where each index of the input is randomly assigned to one of the buckets; denotes the sparsity of the dataset. The sketch of an input vector is obtained by computing the parity of the bits fallen in each bucket. We offer a better compression bound than theirs. For a pair of vectors, their compression bounds are , while ours is . This is also reflected in our empirical evaluations, on small values of compression length, outperforms . However, the compression times (or dimensionality reduction time) of both the algorithms are somewhat comparable.

For Jaccard Similarity, we compare the performance of our algorithms with  [3],  [25] – a faster variant of , and  [20]. We would like to point out some key differences between and . is two-step in nature that takes the sketch obtained by running on the original data as input, and outputs binary sketch which maintains an estimate of the original Jaccard similarity. Due to this two-step nature, its compression time is higher (see Table I and Figure 3). The number of functions used in (denoted by ) is a crucial parameter and the authors suggested using such that the pairwise symmetric difference is approximately . Empirically they suggest using , where is the similarity threshold. We argue that not only tuning is an important step but it is unclear how this condition will be satisfied for a diverse dataset, on the contrary, requires no such parameter. Furthermore, doesn’t provide any closed form expression to estimate accuracy and confidence. However, the variance of the critical term of their estimator is linear in the size of the sketch, i.e. . Whereas our confidence interval is of the order of which could be far smaller compared to , even for non-sparse data. Finally, compared to the Poisson approximation based analysis used in , we employed a tighter martingale-based analysis leading to (slightly) better concentration bounds (compare, e.g., the concentration bounds for estimating the size of a set from its sketch).

For Cosine Similarity, we compare with  [9],  [27] – a faster variant of ,  [24], using  [25] in the algorithm of [24] instead of . For the Inner Product,  [22], Asymmetric MinHash [24], and Asymmetric – using  [25] in [24], were the competing algorithms. In all these similarity measures, for sparse binary datasets, our proposed algorithm is faster, while simultaneously offering almost a similar performance as compared to the baselines. We experimentally compare the performance on several real-world datasets and observed the results that are in line with these observations. Further, in order to get a sketch of size , our algorithm requires a lesser number of random bits, and requires only one pass to the datasets. These are the major reasons due to which we obtained good speedup in compression time. We summarize this comparison in Table I. Finally, a major advantage of our algorithm, similar to [21, 22], is that it gives one-shot sketching by maintaining estimates of multiple similarity measures in the same sketch; this is in contrast to usual sketches that are customized for a specific similarity.

Algorithm No of random bits Compression time
 [21, 22]
 [25]
 [27]
 [20]
 [9]
 [3]
TABLE I: A comparison among the candidate algorithms, on the number of random bits and the compression time, to get a sketch of length of a single data object. Compression time includes both (i) time required to generate hash function, which is of order the number of random bits, (ii) time required to generate the sketch using the hash functions. The parameter for denotes the number of permutations required by an intermediate step.

Connection with Bloom Filter

appears structurally similar to a Bloom filter with one hash function. The standard Bloom filter is a space-efficient data-structure for set-membership queries; however, there is an alternative approach that can be used to estimate the intersection between two sets [4]. However, it is unclear how estimates for other similarity measures can be obtained. We answer this question positively and suggest estimates for all the four similarity measures in the same sketch. We also show that our estimates are strongly concentrated around their expected values.

I-C Applicability of our results

For high dimensional sparse binary datasets, due to its simplicity, efficiency, and performance, can be used in numerous applications which require a sketch preserving Jaccard, cosine, Hamming distance or inner product similarity.

Scalable Ranking and deduplication of documents

Given a corpus of documents and a set of query documents, a goal is to find all documents in the corpus that are “similar” to query documents under a given similarity measure (e.g., Jaccard, cosine, inner product). This problem is a fundamental sub-routine in many applications like near-duplicate data detection [19, 14, 5], efficient document similarity search [16, 24], plagiarism detection [7, 5], etc. and dimensionality reduction is one way to address this problem. In Subsection IV-B we provide empirical validation that offers significant speed-up in dimensionality reduction while offering a comparable accuracy.

Scalable Clustering of documents

can be used in scaling up the performance of several clustering algorithms, in the case of high-dimensional and sparse datasets. For instance, in the case of Spherical -means clustering, which is the problem of clustering data points using Cosine Similarity, one can use [11]; and for -mode clustering, which is clustering using Hamming Distance, one can use -mode [15], on the sketch obtained by .

Other Applications

Beyond the above-noted applications, sketching techniques have been used widely in application such as Spam detection [6], compressing social networks [10] all pair similarity [2], Frequent Itemset Mining [8]. As offers significant speed-up in dimensionality reduction time and simultaneously provides a succinct and accurate sketch, it helps in scaling up the performance of the respective algorithms.

Ii Background

Notations
dimension of the compressed data.
sparsity bound.
-th bit position of binary vector
number of ’s in the binary vector .
Cosine similarity between and
Jaccard similarity between and
Hamming distance between and
Inner product between and

SimHash for Cosine similarity [9, 12].

The Cosine similarity between a pair of vectors is defined as . To compute a sketch of a vector ,  [9] generates a random vector , with each component chosen uniformly at random from and a 1-bit sketch is computed as

was shown to preserve inner product in the following manner [12]. Let be an angle such that . Then,

MinHash for Jaccard and Cosine similarity.

The Jaccard similarity between a pair of set is defined as Broder et al. [3] suggested an algorithm – – to compress a collection of sets while preserving the Jaccard similarity between any pair of sets. Their technique includes taking a random permutation of and assigning a value to each set which maps to minimum under that permutation.

Definition 2 (Minhash [3]).

Let be a random permutation over , then for a set for .

It was then shown by Broder et al. [3, 5] that

Exploiting a similarity between Jaccard similarity of sets and Cosine similarity of binary vectors, it was shown how to use for constructing sketches for Cosine similarity in the case of sparse binary data [23].

BCS for sparse binary data [22, 21].

For sparse binary dataset, offers a sketching algorithm that simultaneously preserves Jaccard similarity, Hamming distance and inner product.

Definition 3 (Bcs).

Let be the number of buckets. Choose a random mapping from to . Then a vector is compressed to a vector as follows:

Iii Analysis of

Let and denote two binary vectors in -dimension, and , denotes the number of in and . Let denote the compressed representation of and , where denotes the compression length (or reduced dimension). In this section we will explain our sketching method and give theoretical bounds on its efficacy.

Definition 4 ().

Let be a random mapping from to . Then a vector is compressed into a vector as

Constructing a for a dataset involves first, generating a random mapping , and second, hashing each vector in the dataset using . There could be possible mappings, so choosing requires time and that many random bits. Hashing a vector involves only looking at the non-zero bits in and that step takes time since . Both these costs compete favorably with the existing algorithms as tabulated in Table I.

Iii-a Inner-product similarity

The sketches, ’s do not quite “preserve” inner-product by themselves, but are related to the latter in the following sense. We will use to denote ; it will be helpful to note that as increases.

Lemma 5.
Proof.

It will be easier to identify as a subset of . The -th bit of can be set only by some element in which can happen with probability . The -th bit of both and is set if it is set by some element in , or if it is set simultaneously by some element in and by another element in . This translates to the following probability that some particular bit is set in both and .

The lemma follows from the above probabilities using the linearity of expectation. ∎

Note that the above lemma allows us to express as

Algorithm 1 now explains how to use this result to approximately calculate using their sketches and .

Input: Sketches of and of

1:Estimate as , as
2:Estimate as
3:Approximate as and as
4:return approximation of as
Algorithm 1 estimation of

We will prove that Algorithm 1 estimates with high accuracy and confidence if we use ; can be set to any desired probability of error and we assume that the sparsity is not too small, say at least 20. Our first result proves that the estimated above is a good approximation of ; exactly identical result holds for and too.

Lemma 6.

With probability at least , it holds that

Proof.

The proof of this lemma is a simple adaptation of the computation of the expected number of non-empty bins in a balls-and-bins experiment that is found in textbooks and done using Doob’s martingale. Identify the random mapping , where the number of 1’s in is denoted by , as throwing black balls (and “no”-balls), one-by-one, into bins chosen uniformly at random. Supposing we only consider the black balls in the bins, then is an indicator variable for the event that the -th bin is non-empty and the number of non-empty bins can be shown to be concentrated around their expectation 333Using to denote the number of non-empty bins and the number of balls, Azuma-Hoeffding inequality states that (see Probability and Computing, Mitzenmacher and Upfal, Cambridge Univ. Press).. Since the number of non-empty bins correspond to , this concentration bound can be directly applied for proving the lemma.

Let denote the event in the statement of the lemma. Then,

where is used for the first inequality and the stated bound, with , is used for the second inequality. ∎

Similar, but more involved, approach can be used to prove that is a good estimation of .

Lemma 7.

With probability at least , it holds that

Proof.

For a given , lets partition into parts (consisting of positions at which both and are 1), (positions at which is 1 and is 0), (positions at which is 0 and is 1) and (the rest). Any random mapping can treated as throwing grey balls, white balls, black balls, and “no”-balls randomly into bins. Suppose we say that a bin is “greyish” if it either contains some grey ball or both a white and a black ball. The number of common 1-bits in and (that is ) is now equal to the number of greyish bins. Observe that when any ball lands in some bin, say , the number of greyish bins either remains same or increases by 1; therefore, we can say that the count of the greyish bins satisfies Lipschitz condition. This allows us to apply Azuma-Hoeffding inequality as above and prove the lemma; we will also need the fact that the number of greyish bins is at most . ∎

The next lemma allows us to claim that our estimation of is also within reasonable bounds. It should be noted that our sketches do not explicitly save the number of 1’s in , so it is necessary to compute this number from our sketches; furthermore, since this estimate is not used elsewhere, we do not mandate it to be an integer either.

Lemma 8.

With probability at least , it holds that

Proof.

Based on Lemma 5 and Algorithm 1, . For the proof we use the upper bound given in Lemma 6 that holds with probability at least . We need a few results before proceeding that are based on the standard inequality for .

Observation 9.

( )

Observation 10.

. Since , we get that .

Observation 11.

.

A proof of the above observation follows using simple algebra and the result of Lemma 6. We defer it to the full version of the paper. We use these observations for proving two possible cases of the lemma. We will use the notation .

case (i) : In this case and

For the R.H.S., by Lemma 6.
For the L.H.S., we can write as . Furthermore, since for reasonable values of and .
Combining the bounds above we get the inequality that we will further process below.

case (ii) : In this case and

As above, R.H.S. is at most using Lemma 6 and L.H.S. can be written as . Further using Observation 11 we get the inequality, .

For both the above cases we obtained that , i.e., . This gives us that employing the known inequality for any . Since , we get the desired upper bound (since for ) (using Observation 11). ∎

Of course a similar result holds for and as well. The next lemma similarly establishes the accuracy of our estimation of .

Lemma 12.

With probability at least , it holds that

We get the following from Algorithm 1 and Lemma 5.

in which (Lemma 8), (similarly), and (Lemma 7), each happening with probability at least . The complete proof that is a good approximation of is mostly algebraic analysis of the above facts and we defer it the full version of the paper.

Theorem 1 is a direct consequence of Lemma 12 for reasonably large (say, beyond ) and small (say, less than ).

Iii-B Hamming distance

The Hamming distance and the inner product similarity of two binary vectors and are related as

The technique used in the earlier subsection can be used to estimate the Hamming distance in a similar manner.

Input: Sketches of and of

1:Calculate , , as done in Algorithm 1
2:return approx. of as
Algorithm 2 estimation of

Iii-C Jaccard similarity

The Jaccard similarity between a pair of binary vectors and can be computed from their Hamming distance and their inner product.

This paves way for an algorithm to compute Jaccard similarity from .

Input: Sketches of and of

1:Calculate using Algorithm 1
2:Calculate using Algorithm 2
3:return approx. of as
Algorithm 3 estimation of

Iii-D Cosine similarity

The cosine similarity between a pair binary vectors and is defined as:

An algorithm for estimating cosine similarity from binary sketches is straight forward to design at this point.

Input: Sketches of and of

1:Calculate , , as done in Algorithm 1
2:return approx. of as
Algorithm 4 estimation of

It should be possible to prove that Algorithms 2, 3 and 4 are accurate and low-error estimations of Hamming distance, Jaccard similarity and cosine similarity, respectively; however, those analysis are left out of this paper.

Iv Experiments

Fig. 1: Comparison of measure on NYTimes datasets. A lower value is an indication of better performance.

Hardware description

We performed our experiments on a machine having the following configuration: CPU: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz x 4; Memory: 7.5 GB; OS: Ubuntu 18.04; Model: Lenovo Thinkpad T430.

To reduce the effect of randomness, we repeated each experiment several times and took the average. Our implementations did not employ any special optimization.

Datasets

The experiments were performed on publicly available datasets - namely, NYTimes news articles (number of points = , dimension = ), Enron Emails (number of points = , dimension = ), and KOS blog entries (number of points = , dimension = ) from the UCI machine learning repository [18]; and BBC News Datasets (number of points = , dimension = [13]. We considered the entire corpus of KOS and BBC News datasets, while for NYTimes, ENRON datasets we sampled data points.

Competing Algorithms

For our experiments we have used three similarity measures: Jaccard Similarity, Cosine Similarity, and Inner Product. For the Jaccard Similarity,  [3], Densified One Permutation Hashing () – a faster variant of – [25],  [22], and  [20] were the competing algorithms. is two-step in nature, which takes the sketch obtained by running on the original data as input, and outputs binary sketch which maintains an estimate of the original Jaccard similarity. As suggested by authors, we use the number of permutations , where is the similarity threshold. We upper bound with which is the maximum number of permutations used by . For the Cosine Similarity,  [9], Circulant Binary Embedding () – a faster variant of – [27],  [23],  [25] in the algorithm of [23] instead of , were the competing algorithms. For the Inner Product,  [22], Asymmetric MinHash [24], and Asymmetric ( [25] in the algorithm of [24]), were the competing algorithms.

Iv-a Experiment : Accuracy of Estimation

In this task, we evaluate the fidelity of the estimate of on various similarity regimes.

Evaluation Metric

To understand the behavior of on various similarity regimes, we extract similar pairs – pair of data objects whose similarity is higher than certain threshold –from the datasets. We used Cosine, Jaccard, and Inner Product as our measures. For example: for Jaccard/Cosine case for the threshold value , we considered only those pairs whose similarities are higher than . We used mean square error as our evaluation criteria. Using and other candidate algorithms, we compressed the datasets to various values of compression length . We then calculated the for all the algorithms, for different values of . For example, in order to calculate the of with respect to the ground truth result, for every pair of data points, we calculated the square of the difference between their estimated similarities after the result of , and the corresponding ground truth similarity. We added these values for all such pairs and calculated its mean. For Inner Product, we used this absolute value, and for Jaccard/Cosine similarity we computed its negative logarithm base . A smaller corresponds to a larger , therefore, a higher value is an indication of better performance.

Insights

We summarize our results in Figures 2, and 1 for Cosine/Jaccard Similarity and Inner Product, respectively. For Cosine Similarity, consistently remains to be better than the other candidates. While for Jaccard Similarity, it significantly outperformed w.r.t. , and , while its performance was comparable w.r.t. . Moreover, for Inner product 1 results, significantly outperformed w.r.t. . 33footnotetext: We observed a similar pattern for both as well as Ranking experiments on other datasets/similarity measures as well. We defer those plot to the full version of the paper.

Iv-B Experiment : Ranking

Evaluation Metric

In this experiment, given a dataset and a set of query points, the aim is to find all the points that are similar to the query points, under the given similarity measure. To do so, we randomly, partition the dataset into two parts – and . The bigger partition is called as the training partition, while the smaller one is called as querying partition. We call each vector of the querying partition as a query vector. For each query vector, we compute the points in the training partition whose similarities are higher than a certain threshold. For Cosine and Jaccard Similarity, we used the threshold values from the set . For Inner Product, we first found out the maximum existing Inner product in the dataset, and then set the thresholds accordingly. For every query point, we first find all the similar points in the uncompressed dataset, which we call as ground truth result. We then compress the dataset, using the candidate algorithms, on various values of compression lengths. To evaluate the performance of the competing algorithms, we used the accuracy-precision-recall- score as our standard measure. If the set denotes the ground truth result (result on the uncompressed dataset), and the set denotes the results on the compressed datasets, then accuracy = , precision = , recall = , and

Fig. 2: Comparison of measure on Enron, NYTimes, and BBC datasets. A higher value is an indication of better performance.
Fig. 3: Comparison of compression times on NYTimes, ENRON, KOS and BBC datasets.
Fig. 4: Comparison of Accuracy and score measures on ENRON, NYTimes and KOS datasets.

Insights

We summarize Accuracy and score results in Figure 4. For Jaccard Similarity, on both Accuracy and score measure, significantly outperformed , , and while its performance was comparable w.r.t. . For Cosine similarity, on higher and intermediate threshold values, outperformed all the other candidate algorithms. However, on the lower threshold values, offered the most accurate sketch followed by .

Efficiency of

We comment on the efficiency of with the other competing algorithms and summarize our results in Figure 3. We noted the time required to compress the original dataset using all the competing algorithms. For a given compression length, the compression time of varies based on the similarity threshold. Therefore, we consider taking their average. We notice that the time required by and is negligible for all values of and on all the datasets. Compression time of is very higher than ours, however, it is independent of the compression length . After excluding some initial compression lengths, the compression time of is the highest, and grows linearly with , as it requires running on the original dataset. For the remaining algorithms, their respective compression time grows linearly with .

V Summary and open questions

In this work, we proposed a simple dimensionality reduction algorithm – – for sparse binary data. offer an efficient dimensionality reduction/sketching algorithm, which compresses a given -dimensional binary dataset to a relatively smaller -dimensional binary sketch, while simultaneously maintaining estimates for multiple similarity measures such as Jaccard Similarity, Cosine Similarity, Inner Product, and Hamming Distance, on the same sketch. The performance of was significantly better than  [21, 22] while the compression (dimensionality reduction) time of these two algorithms were somewhat very comparable. obtained a significant speedup in compression time w.r.t other candidate algorithms ( [3, 23],  [9],  [25],  [27]) while it simultaneously offered a comparable performance guarantee.

We want to highlight the error bound presented in Theorem 1 is due to a worst-case analysis, which potentially can be tightened. We state this as an open question of the paper. Our experiments on real datasets establish this. For example, for the inner product (see Figure 1), we show that the Mean Square Error is almost zero even for compressed dimensions that are much lesser than the bounds stated in the Theorem. Another important open question is to derive a lower bound on the size of a sketch that is required to efficiently and accurately derive similarity values from compressed sketches. Given the simplicity of our method, we hope that it will get adopted in practice.

References

  • [1] A. Agarwal, O. Chapelle, M. Dudík, and J. Langford (2014) A reliable effective terascale linear learning system. Journal of Machine Learning Research 15, pp. 1111–1133. External Links: Link Cited by: §I.
  • [2] R. J. Bayardo, Y. Ma, and R. Srikant (2007) Scaling up all pairs similarity search. See DBLP:conf/www/2007, pp. 131–140. External Links: Link, Document Cited by: §I-C.
  • [3] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher (1998) Min-wise independent permutations (extended abstract). See DBLP:conf/stoc/1998, pp. 327–336. External Links: Link, Document Cited by: §I-B, TABLE I, §II, §II, §IV, §V, Definition 2.
  • [4] A. Z. Broder and M. Mitzenmacher (2003) Survey: network applications of bloom filters: A survey. Internet Mathematics 1 (4), pp. 485–509. External Links: Link, Document Cited by: §I-B.
  • [5] A. Z. Broder (2000) Identifying and filtering near-duplicate documents. See DBLP:conf/cpm/2000, pp. 1–10. External Links: Link, Document Cited by: §I-C, §II.
  • [6] A. Z. Broder (1997) On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings, pp. 21–29. Cited by: §I-C.
  • [7] S. Buyrukbilen and S. Bakiras (2013) Secure similar document detection with simhash. See DBLP:conf/sdmw/2013, pp. 61–75. External Links: Link, Document Cited by: §I-C.
  • [8] V. T. Chakaravarthy, V. Pandit, and Y. Sabharwal (2009) Analysis of sampling techniques for association rule mining. See DBLP:conf/icdt/2009, pp. 276–283. External Links: Link, Document Cited by: §I-C.
  • [9] M. Charikar (2002) Similarity estimation techniques from rounding algorithms. See DBLP:conf/stoc/2002, pp. 380–388. External Links: Link, Document Cited by: §I-B, TABLE I, §II, §II, §IV, §V.
  • [10] F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan (2009) On compressing social networks. See DBLP:conf/kdd/2009, pp. 219–228. External Links: Link, Document Cited by: §I-C.
  • [11] I. S. Dhillon and D. S. Modha (2001) Concept decompositions for large sparse text data using clustering. Machine Learning 42 (1/2), pp. 143–175. External Links: Link, Document Cited by: §I-C.
  • [12] M. X. Goemans and D. P. Williamson (1995) Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42 (6), pp. 1115–1145. External Links: Link, Document Cited by: §II, §II.
  • [13] D. Greene and P. Cunningham (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proc. 23rd International Conference on Machine learning (ICML’06), pp. 377–384. Cited by: §IV.
  • [14] M. R. Henzinger (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. See DBLP:conf/sigir/2006, pp. 284–291. External Links: Link, Document Cited by: §I-C.
  • [15] Z. Huang (1998-09-01) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2 (3), pp. 283–304. External Links: ISSN 1573-756X, Document, Link Cited by: §I-C.
  • [16] Q. Jiang and M. Sun (2011) Semi-supervised simhash for efficient document similarity search. See DBLP:conf/acl/2011, pp. 93–101. External Links: Link Cited by: §I-C.
  • [17] Y. Jiang, C. Ngo, and J. Yang (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. See DBLP:conf/civr/2007, pp. 494–501. External Links: Link, Document Cited by: §I.
  • [18] M. Lichman (2013) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §IV.
  • [19] G. S. Manku, A. Jain, and A. D. Sarma (2007) Detecting near-duplicates for web crawling. See DBLP:conf/www/2007, pp. 141–150. External Links: Link, Document Cited by: §I-C.
  • [20] M. Mitzenmacher, R. Pagh, and N. Pham (2014) Efficient estimation for high similarities using odd sketches. See DBLP:conf/www/2014, pp. 109–118. External Links: Link, Document Cited by: §I-B, TABLE I, §IV.
  • [21] R. Pratap, R. Kulkarni, and I. Sohony (2018) Efficient dimensionality reduction for sparse binary data. See DBLP:conf/bigdataconf/2018, pp. 152–157. External Links: Link, Document Cited by: §I-B, §I-B, TABLE I, §II, §V.
  • [22] R. Pratap, I. Sohony, and R. Kulkarni (2018) Efficient compression technique for sparse sets. See DBLP:conf/pakdd/2018-3, pp. 164–176. External Links: Link, Document Cited by: §I-B, §I-B, TABLE I, §II, §IV, §V.
  • [23] A. Shrivastava and P. Li (2014) In defense of minhash over simhash. See DBLP:conf/aistats/2014, pp. 886–894. External Links: Link Cited by: §II, §IV, §V.
  • [24] A. Shrivastava and P. Li (2015) Asymmetric minwise hashing for indexing binary inner products and set containment. See DBLP:conf/www/2015, pp. 981–991. External Links: Link, Document Cited by: §I-B, §I-C, §IV.
  • [25] A. Shrivastava (2017) Optimal densification for fast and accurate minwise hashing. See DBLP:conf/icml/2017, pp. 3154–3163. External Links: Link Cited by: §I-B, §I-B, TABLE I, §IV, §V.
  • [26] Y. Singer Sibyl: a system for large scale machine learning.. Technical report. Cited by: §I.
  • [27] F. X. Yu, S. Kumar, Y. Gong, and S. Chang (2014) Circulant binary embedding. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pp. II–946–II–954. External Links: Link Cited by: §I-B, TABLE I, §IV, §V.

Appendix A Proof of Observation 11

In this section we prove that . For this first we derive an upper bound of on .

Let denote the expression appearing in Lemma 6. Using this lemma, . Observe that since and . Furthermore, since , we get the upper bound . For reasonable values of and , both and are at least 4; thus, we get the bound of and this leads us to the bound .

Appendix B Proof of Lemma 12

In this section we derive an upper bound on

Proof.

We first apply triangle inequality and Lemma 8 to obtain

Next we derive an upper bound for the last term for which we require the next few observations. Let denote , denote , and denote .

Observation 13.

By expanding and employing , we obtain that since .

Observation 14.

Using Lemma 5, for non-zero and .

These observations ensure that the terms inside the logarithm are indeed positive.

Next we upper bound by employing the inequality that holds for non-negative and can be derived from the standard inequality for . Here, set and . Then, using triangle inequality

Observation 15.

These claims appear in the proof of Lemma 8: and . Similarly, and .

We need one final observation to compute .

Observation 16.

Using Lemma 7, .

Based on the last two observations we can compute