Fast and Exact Nearest Neighbor Search in Hamming Space on Full-Text Search Engines

Fast and Exact Nearest Neighbor Search in Hamming Space on Full-Text Search Engines

Cun (Matthew) Mu Jet.com/Walmart Labs, Hoboken, NJ

{matthew.mu, raymond, guang, john}@jet.com
   Jun (Raymond) Zhao Jet.com/Walmart Labs, Hoboken, NJ

{matthew.mu, raymond, guang, john}@jet.com
   Guang Yang Jet.com/Walmart Labs, Hoboken, NJ

{matthew.mu, raymond, guang, john}@jet.com
  
Binwei Yang
Walmart Labs, Sunnyvale, CA
BYang@walmartlabs.com
   Zheng (John) Yan Jet.com/Walmart Labs, Hoboken, NJ

{matthew.mu, raymond, guang, john}@jet.com
Abstract

A growing interest has been witnessed recently from both academia and industry in building nearest neighbor search (NNS) solutions on top of full-text search engines. Compared with other NNS systems, such solutions are capable of effectively reducing main memory consumption, coherently supporting multi-model search and being immediately ready for production deployment. In this paper, we continue the journey to explore specifically how to empower full-text search engines with fast and exact NNS in Hamming space (i.e., the set of binary codes). By revisiting three techniques (bit operation, subs-code filtering and data preprocessing with permutation) in information retrieval literature, we develop a novel engineering solution for full-text search engines to efficiently accomplish this special but important NNS task. In the experiment, we show that our proposed approach enables full-text search engines to achieve significant speed-ups over its state-of-the-art term match approach for NNS within binary codes.

Keywords:
Full-text search engine Nearest neighbor search Hamming space Semantic binary embedding Elasticsearch Lucene

1 Introduction

Full-text search engines, based on first-order document-term statistics such as TF-IDF and Okapi BM25, have been deployed ubiquitously in nowadays web applications to help customers find textual documents that match their specified keywords.

Recently, active efforts from both academia and industry [12, 19, 18, 1, 14] have been witnessed to empower full-text search engines with the capability of nearest neighbor search (NNS). Compared with other NNS solutions (e.g., Annoy [3], FLANN [15], FAISS [7] and SPTAG [6]), such full-text search engine based ones have a number of clear advantages.

Implemented in secondary memory.

As demonstrated by Amato et al. (2018), unlike other NNS solutions implemented in main memory , due to the highly optimized disk-based index mechanics behind full-text search engines, NNS systems established on full-text search engines substantially reduce main-memory consumption. This makes such systems more cost-effective and thus more suitable to big-data applications.

Flexible in multi-model search.

As highlighted by Mu et al. (2018), enabling full-text search engines with NNS paves a coherent way for multi-model searches–e.g., allowing users to express their interests in both visual and textual queries (see Fig. 1 for an application of multi-model search in eCommerce)–at which most of other NNS systems fall short.

Figure 1: Illustration of multi-modal search. Full-text search engines, empowered with nearest neighbor search, allow customers to express their interests simultaneously in image queries (whose visual feature vectors will be consumed in NNS to find visually similar products) and textual queries (e.g., color, brand and price range). As a result, products retrieved will not only be visually similar to uploaded images but also satisfy requested keywords from customers.

Ready for production.

Last but not least, as emphasized by Rygl et al. (2017), NNS systems built upon full-text search engines are extremely well-prepared for production deployment. Due to the cutting-edge engineering designs from full-text search engines (e.g., Elasticsearch and Solr), important features like horizontal scaling, I/O and cache optimization, security configuration, index and cluster management, real-time monitoring and RESTful APIs are immediately ready to be consumed by such NNS systems, so that engineers can effectively avoid reinventing the wheel themselves.

Blessed by all the above distinct benefits, we continue this journey to explore specifically effective ways to achieve exact nearest neighbor search in Hamming space (i.e., the set of binary codes) on top of full-text search engines.

Problem statement.

Specifically, with the following dataset of binary codes,

(1)

the goal of our paper is to enable full-text search engines with the capability of efficiently finding all -neighbors of in , namely

(2)

where denotes the Hamming distance between binary code and .111It is worth noting that the -neighbor search problem studied by the paper can also be easily adapted to conduct -NN (-nearest neighbors) search by progressively increasing the Hamming search radius until neighbors are found.

Why binary codes?

Finding nearest neighbors in Hamming space is an extremely important subclass of NNS, as learning and representing textual or visual data with compact and semantic binary vectors is a pretty mature technology and common practice in nowadays information retrieval. Using well-trained binary vectors instead of floating ones enables dramatic reductions in storage and communication costs without too much sacrifice in search accuracy. In particular, eBay builds its whole visual search system [20] by finding nearest neighbors within binary codes extracted from catalog and query images through the supervised semantic-preserving deep Hashing (SSDH) model [11].

What is missing?

However, most of the aforementioned full-text search engines based NNS solutions [19, 18, 1, 14] would fail to deal with binary codes, as they find nearest neighbors on full-text search engines by generating and indexing surrogate textual tokens merely based on information collected from several top entries of each floating vector in terms of magnitude. The only exception is the term match approach developed by Lux and Marques (2013) in their Java library called LIRE (Lucene Image Retrieval). The core idea is to calculate Hamming distance between two binary codes by matching their bits at each position. This term match approach on one hand is a natural way to leverage full-text search engines to conduct NNS, but on the other hand heavily overlooks the intrinsic and special properties within binary codes by treating binary digits simply as texts. Motivated by this, we revisit three pervasive techniques from information retrieval literature: bit operation, sub-code filtering and data preprocessing with permutation. By integrating these three techniques seamlessly into full-text search engines, we end up with a solution that outperforms the term match one dramatically in terms of search latency.

Elasticsearch.

Elasticsearch (ES), built upon Apache Lucene, is an open-source, real-time, distributed and multi-tenant full-text search engine. Since its first release in Feb. 2010, it has become the most popular enterprise search engine and widely adopted by a variety of companies (e.g., Ebay, Facebook, GitHub, Lyft and Shopify) for either internal or external uses to discover relevant documents. Similar to previous works [19, 18, 1, 14], without the loss of general applicability to other full-text search engines, we elaborate our core ideas concretely on the platform of Elasticsearch. First-hand experiences in implementing this ES-based solution are extensively addressed, which should be greatly valuable for practitioners to understand and replicate our approach.

Organization.

The rest of the paper is organized as follows. In Section 2, we first review the term match approach widely implemented on full-text search engines to find nearest neighbors among binary codes. In Section 3, we propose a better one for full-text search engines to accomplish this task. Specifically, we implement an Elasticsearch-based solution called FENSHSES (Fast Exact Neighbor Search in Hamming Space on Elasticsearch) to conduct nearest neighbor search in Hamming space. We incorporate three techniques into FENSHSES: bit operation, which enables Elasticsearch to compute Hamming distance with just a few bit operations; sub-code filtering, which instructs Elasticsearch to conduct a simple but effective screening process before any Hamming distance calculation and therefore empower FENSHSES with sub-linear search times; data preprocessing with permutation, which preprocesses binary codes with appropriate permutation to maximize the effect of sub-code filtering. In Sec. 4, we show that FENSHSES outperforms the term match approach dramatically in terms of search latency.

2 Term Match from LIRE

Based on its definition, Hamming distance is nothing but the number of positions at which two binary codes vary. As a result, full-text search engines can naturally compute this through term match. Specifically, for each binary code , we can index its positions corresponding to ones and zeros, respectively, i.e.,

with . When the query binary code arrives, Elasticsearch can simply calculate its Hamming distance to each binary code by matching its zero and one positions with :

(3)

The JSON-encoded request body for Elasticsearch to find by computing (3) is illustrated in JSON 2,222All ES-related implementations are based on Elasticsearch 6.1.0. where we denote and .

This term match approach, firstly introduced in the Java library LIRE [12] to find visually similar images (based on their binary visual features), is currently the cutting-edge approach for full-text search engines to find nearest neighbors within binary codes. Some of its variants (e.g., using fuzzy query based on Levenshtein edit distance) are also widely used on full-text search engines nowadays.

{listing}

[h] {minted}[linenos=true]json ”min_score”: n-r, ”query”: ”function_score”: ”functions”:[ ”filter”: ”term”: ”Ib”: Iq[1], ”weight”: 1, …, ”filter”: ”term”: ”Ib”: Iq[u], ”weight”: 1, ”filter”: ”term”: ”Ob”: Oq[1], ”weight”: 1, …, ”filter”: ”term”: ”Ob”: Oq[v], ”weight”: 1 ], ”score_mode”: ”sum”, ”boost_mode”: ”replace” ES request body of the term match approach to -neighbor search in Hamming space

3 Proposed Approach: FENSHSES

The term match approach treats each binary digit (i.e., bit) in a textual way, which heavily overlooks the intrinsic and special properties of binary codes. By making better uses of these properties, we introduce a novel approach called FENSHSES (Fast Exact Neighbor Search in Hamming Space on Elasticsearch), whose complete JSON-encoded ES request body can be found in JSON 3.4. In essence, FENSHSES integrates three techniques: bit operation, sub-code filtering and data preprocessing with permutation, which should be generally applicable to other full-text search engines besides Elasticsearch. These three techniques are pervasively used in nearest neighbors search for binary codes; but as fas as we know, this is the first-time such techniques are seamlessly integrated into full-text search engines, and thus leads to a novel NNS solution with minimal main memory consumption, full support in multi-modal search and extreme readiness to be deployed in production (per our discussions in Section 1).

3.1 Bit Operation

Motivated by the well-known fact that hamming distances between binary codes can be computed extremely fast using bit operations, in this part, we will explore how we can replace term match by natively empowering Elasticsearch to calculate hamming distances through bit operations.

For an -bit binary code , we will first segment it into sub-codes:333For simplicity, we assume divides .

(4)

Since , the Hamming distance calculation is reduced into ones with binary codes of much shorter length. We re-implement the assembly codes found in the notable HAKMEM memo [5] to compute the Hamming distance between two short binary codes of length 64 or less into Painless–a simple and secure scripting language designed specifically for Elasticsearch. When the query binary code is issued, we will invoke hmd64bit times to calculate by specifying and as parameters accordingly and then sum them up. The whole process can be efficiently implemented in ES using the function score query, where several functions are combined to calculate the score of each document (see lines 15-31 in JSON 3.4).

{listing}

[h] {minted}[linenos=true]json POST _scripts/hmd64bit ”script”: ”lang”: ”painless”, ”source”: ””” long u = params.subcode^doc[params.field].value; long uCount = u-((u¿¿¿1)&-5270498306774157605L) -((u¿¿¿2)&-7905747460161236407L); return ((uCount+(uCount¿¿¿3))&8198552921648689607L)””” Create the script called hmd64bit into Elasticsearch through the _scripts end-point.

3.2 Sub-Code Filtering

So far, regardless of the term match approach or the bit operation one, we have to exhaustively compute the Hamming distance between and each binary code in . This expensive linear scan is not desirable for many applications where the number of codes in is in the order of millions or even billions [20]. As a remedy, in this part, we will borrow a simple but powerful counting argument from Norouzi et al. (2012) to conduct a screening process before any Hamming distance calculation, which successfully empowers our FENSHSES approach with sub-linear search times.

Suppose binary codes are segmented into sub-codes as in (4). Then for two codes and within Hamming distance, among all their sub-code pairs , there must be at least one pair with Hamming distance no larger than , which mathematically implies

(5)

This simple counting argument yields great potentials in reducing the number of Hamming distance calculations needed to find all -neighbors in . Specifically, according to relationship (5), it is safe to just consider binary codes belonging to the set on the right side of (5), whose size could be substantially smaller than for . It is worth noting that similar ideas have been frequently revisited in many different contexts–e.g., multi-index hashing [16] and string similarity joins [10], and a generalized version of (5) is also derived recently [17].

Due to the inverted-indexing nature of full-text search engines, this sub-code filtering step is extremely suitable and straightforward to be implemented on full-text search engines. Specifically, on Elasticsearch, we can simply leverage the filter context (see lines 8-14 in JSON 3.4), within which each sub-code Hamming ball is obtained by the terms query (e.g., line 11 in JSON 3.4), and the union is achieved through a boolean combination of should clauses.

3.3 Data Preprocessing with Permutation

The effectiveness of sub-code filtering will be maximized if the bits within the same sub-code group are statistically independent. Since hamming distance is invariant to permutation transformation, it is tempting to transform binary codes in with appropriate permutation towards this desired group independence property.

For two Bernoulli random variables and , they are independent if and only if their correlation coefficient . Therefore, it is natural to find a permutation to minimize correlation effects among each sub-code segment. This immediately leads to the following optimization problem proposed by Wan et al. (2013) to improve the performance of [16]:

(6)

Here is a block diagonal matrix with as a matrix of ones and , is the permutation matrix induced by :

and is a matrix in whose -entry is obtained from as the absolute value of the correlation between the -th and the -th bits.

Problem (6) is essentially an instance of the extensively studied balanced graph partition problem [8, 2, 9, 4]. In FENSHSES, we solve problem (6) by the well-known and scalable Kernighan-Lin algorithm [8], the gist of which is to find appropriate pairs and swap their mappings in with

We leave it as a future work to solve (6) by more recently developed approximation algorithms with better theoretical guarantees–e.g., the one based on semidefinite relaxation [9].

3.4 Elasticsearch index for FENSHSES

For the proposed FENSHSES approach, the data to be indexed into Elasticsearch cluster is pretty minimal. For each binary code , we only need to index its sub-codes .444More accurately speaking, what being indexed are the integers represented by the binary sub-codes. In JSON 3.4, we demonstrate how to index sub-codes together with other fields into one Elasticsearch index. Here we assume that bit operation and sub-code filtering segment binary codes in the same manner. If their segmentations are different, it might be recommended to index two sets of sub-codes into the same Elasticsearch index by treating each set as a nested datatype.

Suppose sub-codes are semantic visual embeddings from product images. With other product information (e.g., title, brand and price) indexed together with into one Elasticsearch index, we can easily achieve multi-model searches depicted in Figure 1 by adding filters (e.g., term filter and range filter) into JSON 3.4.

{listing}

[h] {minted}[linenos=true]json PUT /fenshses PUT /fenshses/default/_mapping ”properties”: ”title”: ”type”: ”text”, ”brand”: ”type”: ”keyword”, ”price”: ”type”: ”double”, ”is_in_stock”: ”type”: ”boolean”, …, ”b1”: ”type”: ”long”, …, ”bs”: ”type”: ”long” Request body to create the Elasticsearch index and define its mapping to support the FENSHSES approach.

{listing}

[hb] {minted}[linenos=true]json ”min_score”: m-r, ”query”: ”function_score”: ”query”: ”constant_score”: ”boost”: m, ”filter”: ”bool”: ”should”:[ ”terms”:”b1”: r/m-neighbor of q1, …, ”terms”:”bm”: r/m-neighbor of qs ] , ”functions”:[ ”script_score”: ”script”: ”id”: ”hmd64bit”, ”params”: ”field”: ”b1”, ”subcode”: q1, ”weight”: -1, … ”script_score”: ”script”: ”id”: ”hmd64bit”, ”params”: ”field”: ”bm”, ”subcode”: qm, ”weight”: -1 ], ”boost_mode”: ”sum”, ”score_mode”: ”sum” Elasticsearch request body of the FENSHSES approach to -neighbor search in Hamming space

4 Experiment

We compare search latencies between the term match approach and FENSHSES with semantic binary codes generated from Jet.com’s catalog images. To better understand the contribution of each technique involved in FENSHSES, we experiment systematically with four methods: the term match baseline, FENSHSES with just bit operation, FENSHSES without data preprocessing and FENSHSES.

Settings.

Our dataset is generated using half a million images selected from Jet.com’s furniture catalog through the SSDH model [11], which leverages deep CNN (convolutional neural network) to output semantic binary codes in an end-to-end manner555Note that the purpose of the experiment is not to compare different embedding models, but to evaluate the performance of FENSHESES, which should be generally applicable to NNS in any Hamming space.. We choose the length of binary codes to be and respectively. For the setting of FENSHSES, we keep the sub-code length as for bit operation and for sub-code filtering throughout the experiment, since we observe such segmentations consistently yield satisfactory performances. Each Elasticsearch index is created with five shards and zero replica on a single-node Elustersearch cluster deployed on a Microsoft Azure virtual machine with 12 cores and 112 GiB of RAM.

Evaluation.

We randomly select 1,000 binary codes from to act as query codes . For each , we compare the search latencies among all four methods with Hamming distance .

Term match Bit operation FENSHSES w/o prep. FENSHSES
128 5 641.99 (19.01) 41.38 (6.38) 2.80 (3.50) 1.08 (1.25)
10 638.20 (16.65) 42.24 (7.39) 7.40 (5.07) 3.62 (1.54)
15 637.63 (16.14) 43.08 (7.90) 7.19 (5.09) 3.45 (1.55)
20 638.41 (17.41) 42.65 (7.59) 15.51 (5.88) 9.51 (2.18)
256 5 1259.22 (30.66) 75.35 (11.87) 6.24 (6.48) 2.18 (2.02)
10 1257.04 (20.68) 75.06 (11.27) 6.28 (6.63) 2.13 (1.97)
15 1270.38 (25.88) 75.81 (12.22) 6.70 (6.93) 2.09 (1.56)
20 1278.47 (25.56) 75.50 (11.94) 18.02 (10.71) 7.67 (2.85)
Table 1: Means and standard deviations (in brackets) of search latency (measured in ms) under different scenarios. FENSHSES is dramatically faster than the term match approach, and all of the three techniques involved in FENSHSES contribute substantially to this performance improvement.
Figure 2: Experimental results for 128-bit binary codes.
Figure 3: Experimental results for 256-bit binary codes.

Results.

As shown in Table 1, FENSHSES is much faster than the term match approach. To better visualize these speed-ups, we plot average search latencies under logarithmic transformation in Fig. 2 and 3. In the following, we address the contribution of each component of FENSHSES respectively.

  • By computing the Hamming distance using bit operation instead of term match, we consistently observe around sixteen times speedup over different and .

  • The amount of speed-up introduced by sub-code filtering varies with the radius . Specifically, as heavily influence the number of data points to be considered for Hamming distance computation (see (5)), for ’s with the same value of , the search latencies of FENSHSES w/o prep. are quite similar. As becomes larger, the sub-code filtering technique will become less effective. In practice, since we most likely care about nearest neighbors within a small radius, the sub-code filtering technique should be capable of greatly reducing the search latency.

  • By reshuffling binary codes to reduce their correlations within each sub-code group, the technique of data processing with permutation not only accelerates FENSHSES in terms of the average search latency, but also stabilizes its overall performance with much smaller standard deviation.

  • A comprehensive comparison between FENSHSES and FAISS [7] in terms of indexing speed, search latency and RAM consumption is also conducted in [13], where FENSHSES demonstrates competitive performance.

5 Conclusion

It has been recently demonstrated that NNS systems built upon full-text search engines are capable of effectively reducing main memory consumption, coherently supporting multi-model search and being well-prepared for production deployment. Motivated by these clear advantages, in this paper, we explore how to empower full-text search engines to efficiently find nearest neighbors in Hamming space. By revisiting bit operation, sub-code filtering and data preprocessing with permutation, we propose a novel approach to accomplish this task, which shown empirically to be substantially faster than the term match approach (the state-of-art one for nowadays full-text search engines to find nearest neighbors within binary codes). By implementing the proposed approach non-trivially on the Elasticsearch platform, we delivered a cutting-edge engineering solution called FENSHSES. In the future, we will also explore how to implement our approach efficiently on other full-text search engines (e.g., Solr and Sphinx).

Acknowledgment

We are grateful to three anonymous reviewers for their helpful suggestions and comments that substantially improve the paper. We would also like to thank Aliasgar Kutiyanawala for helping us fix a bug in an earlier version of JSON 3.1, and thank Eliot P. Brenner for bringing the work [19] to our attention.

References

  • [1] Amato, G., Bolettieri, P., Carrara, F., Falchi, F., Gennaro, C.: Large-scale image retrieval with elasticsearch. In: SIGIR (2018)
  • [2] Andreev, K., Racke, H.: Balanced graph partitioning. Theory of Computing Systems 39(6), 929–939 (2006)
  • [3] B., E.: Annoy: Approximate Nearest Neighbors in C++/Python (2018), https://pypi.org/project/annoy/, python package version 1.13.0
  • [4] Bader, D.A., Meyerhenke, H., Sanders, P., Wagner, D.: Graph partitioning and graph clustering, vol. 588. American Mathematical Soc. (2013)
  • [5] Beeler, M., Gosper, R.W., Schroeppel, R.: Hakmem. MIT Artificial Intelligence Laboratory (1972)
  • [6] Chen, Q., Wang, H., Li, M., Ren, G., Li, S., Zhu, J., Li, J., Liu, C., Zhang, L., Wang, J.: SPTAG: A library for fast approximate nearest neighbor search (2018), https://github.com/Microsoft/SPTAG
  • [7] Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017)
  • [8] Kernighan, B., Lin, S.: An efficient heuristic procedure for partitioning graphs. The Bell system technical journal 49(2), 291–307 (1970)
  • [9] Krauthgamer, R., Naor, J., Schwartz, R.: Partitioning graphs into balanced components. In: SODA (2009)
  • [10] Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: A partition-based method for similarity joins. In: VLDB (2012)
  • [11] Lin, K., Yang, H.F., Hsiao, J.H., Chen, C.S.: Deep learning of binary hash codes for fast image retrieval. In: CVPR DeepVision Workshop (2015)
  • [12] Lux, M., Marques, O.: Visual Information Retrieval Using Java and LIRE, vol. 25. Morgan & Claypool Publishers (2013)
  • [13] Mu, C., Yang, B., Yan, Z.: An empirical comparison of faiss and fenshses for nearest neighbor search in hamming space. In: SIGIR eCommerce Workshop (2019)
  • [14] Mu, C., Zhao, J., Yang, G., Zhang, J., Yan, Z.: Towards practical visual search engine within elasticsearch. In: SIGIR eCommerce Workshop (2018)
  • [15] Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2227–2240 (2014)
  • [16] Norouzi, M., Punjani, A., Fleet, D.J.: Fast search in hamming space with multi-index hashing. In: CVPR (2012)
  • [17] Qin, J., Wang, Y., Xiao, C., Wang, W., Lin, X., Ishikawa, Y.: GPH: Similarity search in hamming space. In: ICDE (2018)
  • [18] Ruzicka, M., Novotny, V., Sojka, P., Pomikalek, J., Rehurek, R.: Flexible similarity search of semantic vectors using fulltext search engines. http://ceur-ws.org/Vol-1923/article-01.pdf (2018)
  • [19] Rygl, J., Pomikalek, J., Rehurek, R., Ruzicka, M., Novotny, V., Sojka, P.: Semantic vector encoding and similarity search using fulltext search engines. In: RepL4NLP Workshop (2017)
  • [20] Yang, F., Kale, A., Bubnov, Y., Stein, L., Wang, Q., Kiapour, H., Piramuthu, R.: Visual search at ebay. In: KDD (2017)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
384212
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description