Fast and Exact Nearest Neighbor Search in Hamming Space on FullText Search Engines
Abstract
A growing interest has been witnessed recently from both academia and industry in building nearest neighbor search (NNS) solutions on top of fulltext search engines. Compared with other NNS systems, such solutions are capable of effectively reducing main memory consumption, coherently supporting multimodel search and being immediately ready for production deployment. In this paper, we continue the journey to explore specifically how to empower fulltext search engines with fast and exact NNS in Hamming space (i.e., the set of binary codes). By revisiting three techniques (bit operation, subscode filtering and data preprocessing with permutation) in information retrieval literature, we develop a novel engineering solution for fulltext search engines to efficiently accomplish this special but important NNS task. In the experiment, we show that our proposed approach enables fulltext search engines to achieve significant speedups over its stateoftheart term match approach for NNS within binary codes.
Keywords:
Fulltext search engine Nearest neighbor search Hamming space Semantic binary embedding Elasticsearch Lucene1 Introduction
Fulltext search engines, based on firstorder documentterm statistics such as TFIDF and Okapi BM25, have been deployed ubiquitously in nowadays web applications to help customers find textual documents that match their specified keywords.
Recently, active efforts from both academia and industry [12, 19, 18, 1, 14] have been witnessed to empower fulltext search engines with the capability of nearest neighbor search (NNS). Compared with other NNS solutions (e.g., Annoy [3], FLANN [15], FAISS [7] and SPTAG [6]), such fulltext search engine based ones have a number of clear advantages.
Implemented in secondary memory.
As demonstrated by Amato et al. (2018), unlike other NNS solutions implemented in main memory , due to the highly optimized diskbased index mechanics behind fulltext search engines, NNS systems established on fulltext search engines substantially reduce mainmemory consumption. This makes such systems more costeffective and thus more suitable to bigdata applications.
Flexible in multimodel search.
As highlighted by Mu et al. (2018), enabling fulltext search engines with NNS paves a coherent way for multimodel searches–e.g., allowing users to express their interests in both visual and textual queries (see Fig. 1 for an application of multimodel search in eCommerce)–at which most of other NNS systems fall short.
Ready for production.
Last but not least, as emphasized by Rygl et al. (2017), NNS systems built upon fulltext search engines are extremely wellprepared for production deployment. Due to the cuttingedge engineering designs from fulltext search engines (e.g., Elasticsearch and Solr), important features like horizontal scaling, I/O and cache optimization, security configuration, index and cluster management, realtime monitoring and RESTful APIs are immediately ready to be consumed by such NNS systems, so that engineers can effectively avoid reinventing the wheel themselves.
Blessed by all the above distinct benefits, we continue this journey to explore specifically effective ways to achieve exact nearest neighbor search in Hamming space (i.e., the set of binary codes) on top of fulltext search engines.
Problem statement.
Specifically, with the following dataset of binary codes,
(1) 
the goal of our paper is to enable fulltext search engines with the capability of efficiently finding all neighbors of in , namely
(2) 
where denotes the Hamming distance between binary code and .^{1}^{1}1It is worth noting that the neighbor search problem studied by the paper can also be easily adapted to conduct NN (nearest neighbors) search by progressively increasing the Hamming search radius until neighbors are found.
Why binary codes?
Finding nearest neighbors in Hamming space is an extremely important subclass of NNS, as learning and representing textual or visual data with compact and semantic binary vectors is a pretty mature technology and common practice in nowadays information retrieval. Using welltrained binary vectors instead of floating ones enables dramatic reductions in storage and communication costs without too much sacrifice in search accuracy. In particular, eBay builds its whole visual search system [20] by finding nearest neighbors within binary codes extracted from catalog and query images through the supervised semanticpreserving deep Hashing (SSDH) model [11].
What is missing?
However, most of the aforementioned fulltext search engines based NNS solutions [19, 18, 1, 14] would fail to deal with binary codes, as they find nearest neighbors on fulltext search engines by generating and indexing surrogate textual tokens merely based on information collected from several top entries of each floating vector in terms of magnitude. The only exception is the term match approach developed by Lux and Marques (2013) in their Java library called LIRE (Lucene Image Retrieval). The core idea is to calculate Hamming distance between two binary codes by matching their bits at each position. This term match approach on one hand is a natural way to leverage fulltext search engines to conduct NNS, but on the other hand heavily overlooks the intrinsic and special properties within binary codes by treating binary digits simply as texts. Motivated by this, we revisit three pervasive techniques from information retrieval literature: bit operation, subcode filtering and data preprocessing with permutation. By integrating these three techniques seamlessly into fulltext search engines, we end up with a solution that outperforms the term match one dramatically in terms of search latency.
Elasticsearch.
Elasticsearch (ES), built upon Apache Lucene, is an opensource, realtime, distributed and multitenant fulltext search engine. Since its first release in Feb. 2010, it has become the most popular enterprise search engine and widely adopted by a variety of companies (e.g., Ebay, Facebook, GitHub, Lyft and Shopify) for either internal or external uses to discover relevant documents. Similar to previous works [19, 18, 1, 14], without the loss of general applicability to other fulltext search engines, we elaborate our core ideas concretely on the platform of Elasticsearch. Firsthand experiences in implementing this ESbased solution are extensively addressed, which should be greatly valuable for practitioners to understand and replicate our approach.
Organization.
The rest of the paper is organized as follows. In Section 2, we first review the term match approach widely implemented on fulltext search engines to find nearest neighbors among binary codes. In Section 3, we propose a better one for fulltext search engines to accomplish this task. Specifically, we implement an Elasticsearchbased solution called FENSHSES (Fast Exact Neighbor Search in Hamming Space on Elasticsearch) to conduct nearest neighbor search in Hamming space. We incorporate three techniques into FENSHSES: bit operation, which enables Elasticsearch to compute Hamming distance with just a few bit operations; subcode filtering, which instructs Elasticsearch to conduct a simple but effective screening process before any Hamming distance calculation and therefore empower FENSHSES with sublinear search times; data preprocessing with permutation, which preprocesses binary codes with appropriate permutation to maximize the effect of subcode filtering. In Sec. 4, we show that FENSHSES outperforms the term match approach dramatically in terms of search latency.
2 Term Match from LIRE
Based on its definition, Hamming distance is nothing but the number of positions at which two binary codes vary. As a result, fulltext search engines can naturally compute this through term match. Specifically, for each binary code , we can index its positions corresponding to ones and zeros, respectively, i.e.,
with . When the query binary code arrives, Elasticsearch can simply calculate its Hamming distance to each binary code by matching its zero and one positions with :
(3) 
The JSONencoded request body for Elasticsearch to find by computing (3) is illustrated in JSON 2,^{2}^{2}2All ESrelated implementations are based on Elasticsearch 6.1.0. where we denote and .
This term match approach, firstly introduced in the Java library LIRE [12] to find visually similar images (based on their binary visual features), is currently the cuttingedge approach for fulltext search engines to find nearest neighbors within binary codes. Some of its variants (e.g., using fuzzy query based on Levenshtein edit distance) are also widely used on fulltext search engines nowadays.
[h] {minted}[linenos=true]json ”min_score”: nr, ”query”: ”function_score”: ”functions”:[ ”filter”: ”term”: ”Ib”: Iq[1], ”weight”: 1, …, ”filter”: ”term”: ”Ib”: Iq[u], ”weight”: 1, ”filter”: ”term”: ”Ob”: Oq[1], ”weight”: 1, …, ”filter”: ”term”: ”Ob”: Oq[v], ”weight”: 1 ], ”score_mode”: ”sum”, ”boost_mode”: ”replace”
3 Proposed Approach: FENSHSES
The term match approach treats each binary digit (i.e., bit) in a textual way, which heavily overlooks the intrinsic and special properties of binary codes. By making better uses of these properties, we introduce a novel approach called FENSHSES (Fast Exact Neighbor Search in Hamming Space on Elasticsearch), whose complete JSONencoded ES request body can be found in JSON 3.4. In essence, FENSHSES integrates three techniques: bit operation, subcode filtering and data preprocessing with permutation, which should be generally applicable to other fulltext search engines besides Elasticsearch. These three techniques are pervasively used in nearest neighbors search for binary codes; but as fas as we know, this is the firsttime such techniques are seamlessly integrated into fulltext search engines, and thus leads to a novel NNS solution with minimal main memory consumption, full support in multimodal search and extreme readiness to be deployed in production (per our discussions in Section 1).
3.1 Bit Operation
Motivated by the wellknown fact that hamming distances between binary codes can be computed extremely fast using bit operations, in this part, we will explore how we can replace term match by natively empowering Elasticsearch to calculate hamming distances through bit operations.
For an bit binary code , we will first segment it into subcodes:^{3}^{3}3For simplicity, we assume divides .
(4) 
Since , the Hamming distance calculation is reduced into ones with binary codes of much shorter length. We reimplement the assembly codes found in the notable HAKMEM memo [5] to compute the Hamming distance between two short binary codes of length 64 or less into Painless–a simple and secure scripting language designed specifically for Elasticsearch. When the query binary code is issued, we will invoke hmd64bit times to calculate by specifying and as parameters accordingly and then sum them up. The whole process can be efficiently implemented in ES using the function score query, where several functions are combined to calculate the score of each document (see lines 1531 in JSON 3.4).
[h] {minted}[linenos=true]json POST _scripts/hmd64bit ”script”: ”lang”: ”painless”, ”source”: ””” long u = params.subcode^doc[params.field].value; long uCount = u((u¿¿¿1)&5270498306774157605L) ((u¿¿¿2)&7905747460161236407L); return ((uCount+(uCount¿¿¿3))&8198552921648689607L)”””
3.2 SubCode Filtering
So far, regardless of the term match approach or the bit operation one, we have to exhaustively compute the Hamming distance between and each binary code in . This expensive linear scan is not desirable for many applications where the number of codes in is in the order of millions or even billions [20]. As a remedy, in this part, we will borrow a simple but powerful counting argument from Norouzi et al. (2012) to conduct a screening process before any Hamming distance calculation, which successfully empowers our FENSHSES approach with sublinear search times.
Suppose binary codes are segmented into subcodes as in (4). Then for two codes and within Hamming distance, among all their subcode pairs , there must be at least one pair with Hamming distance no larger than , which mathematically implies
(5) 
This simple counting argument yields great potentials in reducing the number of Hamming distance calculations needed to find all neighbors in . Specifically, according to relationship (5), it is safe to just consider binary codes belonging to the set on the right side of (5), whose size could be substantially smaller than for . It is worth noting that similar ideas have been frequently revisited in many different contexts–e.g., multiindex hashing [16] and string similarity joins [10], and a generalized version of (5) is also derived recently [17].
Due to the invertedindexing nature of fulltext search engines, this subcode filtering step is extremely suitable and straightforward to be implemented on fulltext search engines. Specifically, on Elasticsearch, we can simply leverage the filter context (see lines 814 in JSON 3.4), within which each subcode Hamming ball is obtained by the terms query (e.g., line 11 in JSON 3.4), and the union is achieved through a boolean combination of should clauses.
3.3 Data Preprocessing with Permutation
The effectiveness of subcode filtering will be maximized if the bits within the same subcode group are statistically independent. Since hamming distance is invariant to permutation transformation, it is tempting to transform binary codes in with appropriate permutation towards this desired group independence property.
For two Bernoulli random variables and , they are independent if and only if their correlation coefficient . Therefore, it is natural to find a permutation to minimize correlation effects among each subcode segment. This immediately leads to the following optimization problem proposed by Wan et al. (2013) to improve the performance of [16]:
(6) 
Here is a block diagonal matrix with as a matrix of ones and , is the permutation matrix induced by :
and is a matrix in whose entry is obtained from as the absolute value of the correlation between the th and the th bits.
Problem (6) is essentially an instance of the extensively studied balanced graph partition problem [8, 2, 9, 4]. In FENSHSES, we solve problem (6) by the wellknown and scalable KernighanLin algorithm [8], the gist of which is to find appropriate pairs and swap their mappings in with
We leave it as a future work to solve (6) by more recently developed approximation algorithms with better theoretical guarantees–e.g., the one based on semidefinite relaxation [9].
3.4 Elasticsearch index for FENSHSES
For the proposed FENSHSES approach, the data to be indexed into Elasticsearch cluster is pretty minimal. For each binary code , we only need to index its subcodes .^{4}^{4}4More accurately speaking, what being indexed are the integers represented by the binary subcodes. In JSON 3.4, we demonstrate how to index subcodes together with other fields into one Elasticsearch index. Here we assume that bit operation and subcode filtering segment binary codes in the same manner. If their segmentations are different, it might be recommended to index two sets of subcodes into the same Elasticsearch index by treating each set as a nested datatype.
Suppose subcodes are semantic visual embeddings from product images. With other product information (e.g., title, brand and price) indexed together with into one Elasticsearch index, we can easily achieve multimodel searches depicted in Figure 1 by adding filters (e.g., term filter and range filter) into JSON 3.4.
[h] {minted}[linenos=true]json PUT /fenshses PUT /fenshses/default/_mapping ”properties”: ”title”: ”type”: ”text”, ”brand”: ”type”: ”keyword”, ”price”: ”type”: ”double”, ”is_in_stock”: ”type”: ”boolean”, …, ”b1”: ”type”: ”long”, …, ”bs”: ”type”: ”long”
[hb] {minted}[linenos=true]json ”min_score”: mr, ”query”: ”function_score”: ”query”: ”constant_score”: ”boost”: m, ”filter”: ”bool”: ”should”:[ ”terms”:”b1”: r/mneighbor of q1, …, ”terms”:”bm”: r/mneighbor of qs ] , ”functions”:[ ”script_score”: ”script”: ”id”: ”hmd64bit”, ”params”: ”field”: ”b1”, ”subcode”: q1, ”weight”: 1, … ”script_score”: ”script”: ”id”: ”hmd64bit”, ”params”: ”field”: ”bm”, ”subcode”: qm, ”weight”: 1 ], ”boost_mode”: ”sum”, ”score_mode”: ”sum”
4 Experiment
We compare search latencies between the term match approach and FENSHSES with semantic binary codes generated from Jet.com’s catalog images. To better understand the contribution of each technique involved in FENSHSES, we experiment systematically with four methods: the term match baseline, FENSHSES with just bit operation, FENSHSES without data preprocessing and FENSHSES.
Settings.
Our dataset is generated using half a million images selected from Jet.com’s furniture catalog through the SSDH model [11], which leverages deep CNN (convolutional neural network) to output semantic binary codes in an endtoend manner^{5}^{5}5Note that the purpose of the experiment is not to compare different embedding models, but to evaluate the performance of FENSHESES, which should be generally applicable to NNS in any Hamming space.. We choose the length of binary codes to be and respectively. For the setting of FENSHSES, we keep the subcode length as for bit operation and for subcode filtering throughout the experiment, since we observe such segmentations consistently yield satisfactory performances. Each Elasticsearch index is created with five shards and zero replica on a singlenode Elustersearch cluster deployed on a Microsoft Azure virtual machine with 12 cores and 112 GiB of RAM.
Evaluation.
We randomly select 1,000 binary codes from to act as query codes . For each , we compare the search latencies among all four methods with Hamming distance .
Term match  Bit operation  FENSHSES w/o prep.  FENSHSES  

128  5  641.99 (19.01)  41.38 (6.38)  2.80 (3.50)  1.08 (1.25) 
10  638.20 (16.65)  42.24 (7.39)  7.40 (5.07)  3.62 (1.54)  
15  637.63 (16.14)  43.08 (7.90)  7.19 (5.09)  3.45 (1.55)  
20  638.41 (17.41)  42.65 (7.59)  15.51 (5.88)  9.51 (2.18)  
256  5  1259.22 (30.66)  75.35 (11.87)  6.24 (6.48)  2.18 (2.02) 
10  1257.04 (20.68)  75.06 (11.27)  6.28 (6.63)  2.13 (1.97)  
15  1270.38 (25.88)  75.81 (12.22)  6.70 (6.93)  2.09 (1.56)  
20  1278.47 (25.56)  75.50 (11.94)  18.02 (10.71)  7.67 (2.85) 
Results.
As shown in Table 1, FENSHSES is much faster than the term match approach. To better visualize these speedups, we plot average search latencies under logarithmic transformation in Fig. 2 and 3. In the following, we address the contribution of each component of FENSHSES respectively.

By computing the Hamming distance using bit operation instead of term match, we consistently observe around sixteen times speedup over different and .

The amount of speedup introduced by subcode filtering varies with the radius . Specifically, as heavily influence the number of data points to be considered for Hamming distance computation (see (5)), for ’s with the same value of , the search latencies of FENSHSES w/o prep. are quite similar. As becomes larger, the subcode filtering technique will become less effective. In practice, since we most likely care about nearest neighbors within a small radius, the subcode filtering technique should be capable of greatly reducing the search latency.

By reshuffling binary codes to reduce their correlations within each subcode group, the technique of data processing with permutation not only accelerates FENSHSES in terms of the average search latency, but also stabilizes its overall performance with much smaller standard deviation.
5 Conclusion
It has been recently demonstrated that NNS systems built upon fulltext search engines are capable of effectively reducing main memory consumption, coherently supporting multimodel search and being wellprepared for production deployment. Motivated by these clear advantages, in this paper, we explore how to empower fulltext search engines to efficiently find nearest neighbors in Hamming space. By revisiting bit operation, subcode filtering and data preprocessing with permutation, we propose a novel approach to accomplish this task, which shown empirically to be substantially faster than the term match approach (the stateofart one for nowadays fulltext search engines to find nearest neighbors within binary codes). By implementing the proposed approach nontrivially on the Elasticsearch platform, we delivered a cuttingedge engineering solution called FENSHSES. In the future, we will also explore how to implement our approach efficiently on other fulltext search engines (e.g., Solr and Sphinx).
Acknowledgment
We are grateful to three anonymous reviewers for their helpful suggestions and comments that substantially improve the paper. We would also like to thank Aliasgar Kutiyanawala for helping us fix a bug in an earlier version of JSON 3.1, and thank Eliot P. Brenner for bringing the work [19] to our attention.
References
 [1] Amato, G., Bolettieri, P., Carrara, F., Falchi, F., Gennaro, C.: Largescale image retrieval with elasticsearch. In: SIGIR (2018)
 [2] Andreev, K., Racke, H.: Balanced graph partitioning. Theory of Computing Systems 39(6), 929–939 (2006)
 [3] B., E.: Annoy: Approximate Nearest Neighbors in C++/Python (2018), https://pypi.org/project/annoy/, python package version 1.13.0
 [4] Bader, D.A., Meyerhenke, H., Sanders, P., Wagner, D.: Graph partitioning and graph clustering, vol. 588. American Mathematical Soc. (2013)
 [5] Beeler, M., Gosper, R.W., Schroeppel, R.: Hakmem. MIT Artificial Intelligence Laboratory (1972)
 [6] Chen, Q., Wang, H., Li, M., Ren, G., Li, S., Zhu, J., Li, J., Liu, C., Zhang, L., Wang, J.: SPTAG: A library for fast approximate nearest neighbor search (2018), https://github.com/Microsoft/SPTAG
 [7] Johnson, J., Douze, M., Jégou, H.: Billionscale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017)
 [8] Kernighan, B., Lin, S.: An efficient heuristic procedure for partitioning graphs. The Bell system technical journal 49(2), 291–307 (1970)
 [9] Krauthgamer, R., Naor, J., Schwartz, R.: Partitioning graphs into balanced components. In: SODA (2009)
 [10] Li, G., Deng, D., Wang, J., Feng, J.: Passjoin: A partitionbased method for similarity joins. In: VLDB (2012)
 [11] Lin, K., Yang, H.F., Hsiao, J.H., Chen, C.S.: Deep learning of binary hash codes for fast image retrieval. In: CVPR DeepVision Workshop (2015)
 [12] Lux, M., Marques, O.: Visual Information Retrieval Using Java and LIRE, vol. 25. Morgan & Claypool Publishers (2013)
 [13] Mu, C., Yang, B., Yan, Z.: An empirical comparison of faiss and fenshses for nearest neighbor search in hamming space. In: SIGIR eCommerce Workshop (2019)
 [14] Mu, C., Zhao, J., Yang, G., Zhang, J., Yan, Z.: Towards practical visual search engine within elasticsearch. In: SIGIR eCommerce Workshop (2018)
 [15] Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2227–2240 (2014)
 [16] Norouzi, M., Punjani, A., Fleet, D.J.: Fast search in hamming space with multiindex hashing. In: CVPR (2012)
 [17] Qin, J., Wang, Y., Xiao, C., Wang, W., Lin, X., Ishikawa, Y.: GPH: Similarity search in hamming space. In: ICDE (2018)
 [18] Ruzicka, M., Novotny, V., Sojka, P., Pomikalek, J., Rehurek, R.: Flexible similarity search of semantic vectors using fulltext search engines. http://ceurws.org/Vol1923/article01.pdf (2018)
 [19] Rygl, J., Pomikalek, J., Rehurek, R., Ruzicka, M., Novotny, V., Sojka, P.: Semantic vector encoding and similarity search using fulltext search engines. In: RepL4NLP Workshop (2017)
 [20] Yang, F., Kale, A., Bubnov, Y., Stein, L., Wang, Q., Kiapour, H., Piramuthu, R.: Visual search at ebay. In: KDD (2017)