Reconfigurable Inverted Index

Reconfigurable Inverted Index

Yusuke Matsui National Institute of Informatics matsui@nii.ac.jp Ryota Hinami The University of Tokyo hinami@nii.ac.jp  and  Shin’ichi Satoh National Institute of Informatics satoh@nii.ac.jp
Abstract.

Existing approximate nearest neighbor search systems suffer from two fundamental problems that are of practical importance but have not received sufficient attention from the research community. First, although existing systems perform well for the whole database, it is difficult to run a search over a subset of the database. Second, there has been no discussion concerning the performance decrement after many items have been newly added to a system. We develop a reconfigurable inverted index (Rii) to resolve these two issues. Based on the standard IVFADC system, we design a data layout such that items are stored linearly. This enables us to efficiently run a subset search by switching the search method to a linear PQ scan if the size of a subset is small. Owing to the linear layout, the data structure can be dynamically adjusted after new items are added, maintaining the fast speed of the system. Extensive comparisons show that Rii achieves a comparable performance with state-of-the art systems such as Faiss.

Approximate nearest neighbor search; inverted index; product quantization; subset search; reconfigure
copyright: rightsretainedjournalyear: 2018copyright: acmcopyrightconference: 2018 ACM Multimedia Conference; October 22–26, 2018; Seoul, Republic of Koreabooktitle: 2018 ACM Multimedia Conference (MM ’18), October 22–26, 2018, Seoul, Republic of Koreaprice: 15.00doi: 10.1145/3240508.3240630isbn: 978-1-4503-5665-7/18/10ccs: Information systems Nearest-neighbor searchccs: Information systems Search engine indexingccs: Information systems Multimedia and multimodal retrievalccs: Computing methodologies Visual content-based indexing and retrieval\usemintedstyle

colorful

1. Introduction

In recent years, the approximate nearest neighbor search (ANN) has received increasing attention from various research communities (Gudmundsson et al., 2018). Typical ANN systems operate in two stages. In the offline phase, database vectors are stored in the ANN system. These vectors may be converted to other forms, such as compact codes, for fast searching and efficient memory usage. In the online querying phase, the system receives a query vector. Similar items to the query are retrieved from the stored database vectors. Their identifiers (and optionally their distances to the query) are then returned. To handle large datasets, this search should be not only fast and accurate, but also memory efficient.

(a) Subset search
(b) Performance degadation via data addition
Figure 3. The two problems tackled in this paper. (a) The search is operated for a subset of a database, which is specified by the target identifiers. The search result (ranked list) should contain the specified items only. (b) Given a fast (optimized) ANN system, new vectors are added. Is the updated ANN system still fast?

Although many ANN methods have already been proposed, there are two critical problems of practical importance that have not received sufficient attention from the research community (Fig. 3).

  • Subset search (Fig. (a)a): Once database vectors are stored, modern ANN systems can run a search efficiently for the whole database. Surprisingly, however, almost no systems can run a search over a subset of the database111 For example, the state-of-the-art systems Faiss (Jégou et al., 2018) and Annoy (Bernhardsson, 2018) do not provide this functionality. See discussion at https://github.com/facebookresearch/faiss/issues/322, https://github.com/spotify/annoy/issues/263 . For example, let us consider an image search problem, where the search is formulated as an ANN search over feature vectors. We assume that each image also has a corresponding shooting date. Given a query image, an ANN system can easily find similar images from the whole dataset. However, it is not trivial to find similar images that were taken on a target date (say, May 28 1987). Here, the search should not be conducted over the whole dataset, but rather over a subset of the dataset, where the subset is specified by identifiers of target images. The straightforward solution is to run the search and check whether or not the results were taken on May 28, but this post-checking can be drastically slow, especially if the size of the subset is small. Current ANN systems cannot provide a clear solution to this problem.

  • Performance degradation via data addition (Fig. (b)b): So far, the manner in which the search performance degrades when items are newly added has not been discussed. The number of database items is typically assumed to be provided when an ANN system is built. Parameters of the system are usually optimized by taking this number into consideration. However, in a practical scenario, new items might be often added to the system. Although the performance does not change while the number of new items is small, we can ask whether the system remains efficient even after items are newly added. To put this another way, suppose that one would like to develop a search system that can handle 1,000,000 vectors in the future, but only has 1,000 vectors in the initial stage. In such a case, is the search fast even for 1,000 vectors?

We develop an ANN system that solves the above two problems, namely reconfigurable inverted index (Rii). The key idea is extremely simple: storing the data linearly. Based on the well-known inverted file with product quantization (PQ) approach (IVFADC) (Jégou et al., 2011a), we design the data layout such that an item can be fetched by its identifier with a cost of . This simple but critical modification enables us to search over a subset of the dataset efficiently by switching to a linear PQ scan if the size of the subset is small. Owing to this linear layout, the granularity of a coarse assignment step can easily be controlled by running clustering again over the dataset whenever the user wishes. This means that the data structure can be adjusted dynamically after new items are added.

An extensive comparison with state-of-the-art systems, such as Faiss (Jégou et al., 2018), Annoy (Bernhardsson, 2018), Falconn (Razenshteyn and Schmidt, 2018), and NMSLIB (Naidan et al., 2018), shows that Rii achieves a comparable performance. For subset searches and data-addition problems for which the existing approaches do not perform well, we demonstrate that Rii remains fast in all cases.

Our contributions are summarized as follows.

  • Rii enables efficient searching over a subset of the whole database, regardless of the size of the subset.

  • Rii remains fast, even after many new items are added, because the data structure is dynamically adjusted for the current number of database items.

2. Related Work

We review existing work that is closely related to our approach.

Locality-sensitive-hashing

Locality-sensitive-hashing (LSH) (Datar et al., 2004) can be considered as one of the most popular branches of ANN. Hash functions are designed such that the probability of collision is higher for close points than for points that are widely separated. Using these functions with hash tables, nearest items can be found efficiently. Although it has been said that LSH requires a lot of memory and is not accurate compared to data-dependent methods, a recent well-tuned library (FALCONN (Andoni et al., 2015; Razenshteyn and Schmidt, 2018)) using multi-probe technology (Lv et al., 2007) has achieved a reasonable performance.

Projection/tree-based approach

Space partitioning using a projection or tree constitutes another significant branch of ANN. Especially in the computer vision community, one of the most widely employed methods is FLANN (Muja and Lowe, 2014). Recently, the random projection forest-based method Annoy (Bernhardsson, 2018) achieved a good performance for million-scale data.

Graph traversal

Benchmark scores (Aumüller et al., 2017; Bernhardsson et al., 2018) show that graph traversal-based methods (Malkov et al., 2014; Malkov and Yashunin, 2016) achieve the current best performance (the fastest with a fixed recall) when the number of database items is around one million. These methods first create a graph where each node corresponds to a database item, which is called a navigable small world. Given a query, the algorithm starts from a random initial node. The graph is traversed to the node that is the closest to the query. In particular, the hierarchical version HNSW (Malkov and Yashunin, 2016) with the highly optimized implementation NMSLIB (Boytsov and Naidan, 2013) represents the current state-of-the-art. The drawback is that it tends to consume memory space, with a long runtime for building the data structure.

Product quantization

Product quantization (PQ) (Jégou et al., 2011a) and its extensions (Ge et al., 2014; Norouzi and Fleet, 2013; Babenko and Lempitsky, 2014; Martinez et al., 2016; Zhang et al., 2014, 2015; Babenko and Lempitsky, 2015b; Douze et al., 2016; Heo et al., 2014; Jain et al., 2016; Babenko and Lemitsky, 2017; Wang et al., 2015) are popular approaches to handling large-scale data. Our proposed Rii method also follows this line. PQ-based methods compress vectors into short memory-efficient codes. The Euclidean distance between an original vector and compressed code can be efficiently approximated using a lookup table. Current billion-scale search systems are usually based on PQ methods, especially combined with an inverted index-based architecture (Babenko and Lempitsky, 2015a; Kalantidis and Avrithis, 2014; Matsui et al., 2018b; Iwamura et al., 2013; Heo et al., 2016; Spyromitros-Xioufis et al., 2014; Xia et al., 2013). Hardware-based acceleration has also recently been discussed (André et al., 2015, 2017; Blalock and Guttag, 2017; Wieschollek et al., 2016; Johnson et al., 2017; Zhang et al., 2018; Liu et al., 2017). An efficient implementation proposed by the original authors is Faiss (Johnson et al., 2017; Jégou et al., 2018). An extensive survey is given in (Matsui et al., 2018a).

3. Background: Product Quantization

In this section, we will review product quantization (PQ) (Jégou et al., 2011a). PQ compresses vectors into memory efficient short codes. The squared Euclidean distance between an input vector and the compressed code can be approximated efficiently. Owing to its memory-efficient form, PQ played a central role in large-scale ANN systems.

We first describe how to encode a vector. A -dimensional input vector is split into sub-vectors. Each -dimensional sub-vector is compared to pre-trained code words, and the identifier (an integer in ) of the closest one is recorded. Using this, is encoded as , which is a tuple of integers:

(1)

where the th sub-vector in is quantized into . We refer to as a PQ-code for . Note that is represented by bits, and we set to 256 in order to represent each code using bytes.

Next, we show how to search over the PQ-codes given a query vector . First, a distance table is computed online by comparing the query to the code words. Here, is the squared Euclidean distance between the th part of and th code word from the th codebook. The squared Euclidean distance between the query and the database vector can be approximately computed using the PQ-code , as follows:

(2)

This is called an asymmetric distance computation (ADC) (Jégou et al., 2011a), and can be performed efficiently, because only fetches are required on . A search over PQ-codes requires .

4. Reconfigurable Inverted Index

Now, we introduce our proposed approach: reconfigurable inverted index (Rii). Let us define a query vector , database vectors , and target identifiers . The subset-search problem is defined to find the similar items to the query from the subset of specified by :

(3)

where the operator finds the arguments for which an objective function attains (sorted) smallest values. The exact solution can be obtained by a time-consuming direct linear scan. Our goal is to approximately find nearest items in a fast and memory-efficient manner. Note that the problem turns out to be a usual ANN search if the whole database is set as the subset: .

4.1. Data Structure

First, input database vectors are encoded as PQ-codes , where each . These PQ-codes are stored linearly, meaning that they are stored in a single long array. Given an identifier , fetching requires a computational cost of .

The PQ-codes are clustered into groups for inverted indexing. First, coarse centers are created by running the clustering algorithm (Matsui et al., 2017) on (or its subset). Note that each coarse center is also a PQ-code . Using these coarse centers, the database PQ-codes are clustered into groups. The resulting assignments are stored as posting lists , where each is a set of identifiers of the database vectors whose nearest coarse center is the th one:

(4)

Note that is an assignment function, that is defined as , where is a symmetric distance function that measures the distance between two PQ-codes (Jégou et al., 2011a; Matsui et al., 2017). Finally, we store , , and as a data structure for Rii. The total theoretical memory usage is bits if an integer is represented by 32 bits. We will show in Sec. 5.5 that this theoretical value is almost the same as the measured value.

Note that in a typical implementation of the original IVFADC (Jégou et al., 2011a) system, PQ-codes are stored in posting lists (not a single array). That is, are chunked for each and then stored. This would enhance the locality of the data, and improve the cache efficiency when traversing a posting list. However, the experimental results (Sec. 5.5) showed that this difference is not serious.

4.2. Search

We explain how to search for similar vectors using the data structure explained above. Our system provides two search methods: PQ-linear-scan and inverted-index. The former is fast when the size of a target subset is small, and the latter is fast when the size is large. Depending on the size, the faster method is automatically selected.

A search over a subset of a database is defined as a search on target PQ-codes denoted by the target identifiers . Note that we assume that the elements of are sorted222 A set is denoted by calligraphic font, such as , and implemented by a single array. . This is a slightly strong but reasonable assumption. Because is sorted, it can be checked whether or not an item is contained in a set () with a cost of using a binary search, where is the number of elements in . Note again that a search over the whole dataset is available by setting .

PQ-linear-scan

: Because the database PQ-codes are stored linearly, we can simply pick up target PQ-codes and evaluate the distances to the query. We call this a PQ-linear-scan. This is essentially fast if is small, because only a fraction of vectors are compared. The pseudocode is presented in Alg. 1.

As inputs, the system accepts a query vector , database PQ-codes , the number of returned items , and the target identifiers . First, a distance table is created by comparing a query to code words333 We intentionally omitted the code words from the pseudocode, for simplicity. (L1). This is an online pre-processing step, required for all PQ-based methods. To store the results, an array of tuples is prepared (L2). Each tuple consists of (1) an identifier of an item and (2) the distance between the query and the item. For each target identifier , the asymmetric distance to the query is computed (L4). This distance is then stored in the result array with its identifier , where the PushBack function is used to append an element to an array (L5). After all target items have been evaluated, the result array is sorted by the distance (L6). As we require only the top results, we use a partial sort algorithm. Finally, the top elements are returned, where the Take function simply picks up the first several elements (L7). Note that and are not required for the search.

Let us analyze the computational cost. The creation of a distance table requires , and a comparison to items requires . Partial sorting requires on average444 This cost comes from the heap sort-based implementation used in the partial_sort function in C++ STL. Another option is to pick up the smallest items and only sort these. This leads to . We used the former in this paper because we empirically determined that the former is faster in practice, especially when is small. . Their sum leads to a final average cost (Table 1). It is clear that the computation is efficient if is small. As the cost depends on linearly, a PQ-linear-scan becomes inefficient if is large. Note that if the search target is the whole dataset, is replaced by .

Input: ,              # Query
,     # Database PQ-codes
,  # # of returned items
 # Target subset identifiers
Output: s.t.
# : r-th identifier. : r-th distance.
1  # Distance table
2  # Array of tuples (id, distance)
3 for  do
4      
5      
6  # Sort by distance
return  # Top R
Algorithm 1 PQLinearScan
Operation Computational complexity
PQLinearScan
  - whole data
  - susbet
InvertedIndex
  - whole data
  - susbet
Table 1. The average computational complexity for each operation. The range for each variable used in this paper: , , , , , , , .

Inverted-index

: The other search method is inverted-index. Because the database items are preliminarily clustered as explained in Sec. 4.1, we can simply evaluate items that are in the same/close clusters to the query. This drastically boosts the performance if the number of the target identifiers is large.

We show the pseudo-code in Alg. 2. Inverted-index takes three additional inputs: posting lists , coarse centers , and the number of candidates . Note that candidates will be selected and evaluated in the final step. This means that is a runtime parameter that controls the trade-off between the accuracy and runtime.

To search, a distance table is first created in the same manner as for PQ-linear-scan (L1). The search steps consists of two blocks. First, the closest clusters to the query are found (L2-6). Then, the items inside the clusters are evaluated (L7-16).

To find the closest clusters, an array of tuples is created (L2). For each coarse center (, the distance from the query is computed (L4). The results are stored in the array (L5).

Next, we run partial sort on the array to find the closest clusters to the query (L6). Here, the target number of the partial sort (the number of postings lists to be focused) is set as , which is determined as follows. Because the target identifiers are of size , where the total number of identifiers is , the probability of any item being a target identifier is on average. Because our purpose here is to select target items as candidates of the search, the required number of items to traverse is . To traverse items, we need to focus on posting lists, because the average number of items per posting list is . This implies that we need to select the nearest posting lists. Note that if , we simply replace the value by , because this performs a full sort of the array ().

The selected posting lists are then evaluated. A score array is prepared (L7). For each closest posting list (L8), identifiers in the posting list are traversed (L9). If an identifier is not included in the target identifier , then this item is simply ignored (L10-11). Note that if the search is for the whole dataset (), any item is always included in , thus we remove L10-11.

For a selected identifier , the identifier and the distance to the query are recorded in the same manner as for the PQ-linear-scan (L12-13). If the size of the score array () reaches the parameter , then the top results are selected and returned (L14-16).

The computational cost is summarized as follows. After the code creation with , the comparison to coarse centers requires . Partial sort requires . The number of items to be traversed is . We can check whether or not each item is included in using a binary search, requiring . This leads to in total. The number of items that are actually evaluated is , and so of the cost is required. Finally, the top are selected using the partial sort, requiring . Table 1 summarizes the computational cost. Inverted-index is fast when is sufficiently large, but is slow if is small. This is highlighted in the term , where this term becomes dominant if is small.

Note that although there appear to be several input parameters for inverted-index, all of them except are usually decided deterministically. is the only parameter the user needs to decide. Our initial setting is the average length of a posting list, . This means that the system traverses one posting list on average. This is a fast setting, and users can change this if they require more accuracy, as .

Input: ,              # Query
,     # Database PQ-codes
, # Posting lists
,      # Coarse centers
,   # # of returned items
,  # Target subset identifiers
   # # of candidates
Output: s.t.
# : r-th identifier. : r-th distance.
1  # Distance table
2  # Array of tuples (id, distance)
3 for  do
4      
5      
6  # Sort by distance
7  # Array of tuples (id, ditance)
8 for  do
9       for  do
10             if  then
11                   continue
12            
13            
14             if  then
15                    # Sort by distance
16                   return  # Top R
17            
18      
Algorithm 2 InvertedIndex

Selection

: The final query algorithm is described in Alg. 3. Given inputs, the system automatically determines the query method as either PQ-linear-scan or inverted-index. This decision is based on the threshold value for the number of target identifiers (L1). Owing to this flexible switching, we can always achieve a fast search with a single Rii data structure (, , and ), regardless of the sizes of the target identifiers (). Fig. 4 highlights the relations among the three query algorithms.

Note that it is not trivial to set the threshold deterministically, because it depends on several parameters, such as and . To find the best threshold, we simply run the search with several parameter combinations when the data structure is constructed. Based on the result, we fit a 1D line in the parameter space, and finally obtain the best threshold. See the supplementary material for more details. This works almost perfectly, as shown in Fig. 4. This thresholding does not require any additional runtime cost for the search phase.

Input: , , , , , ,
# See the definitions in Alg. 2
Output: s.t.
1 if   then
2       return  # Alg. 1
3else
4       return  # Alg. 2
Algorithm 3 Query
Figure 4. Comparison of PQ-linear-scan, inverted-index, and the final query algorithm. Runtime per query for the SIFT1M dataset with various sizes of target identifiers is plotted. Note that .

4.3. Reconfiguration

Here, we introduce a reconfigure function that enables us to search efficiently even if a large number of vectors are newly added. As discussed in Sec. 1, typical ANN systems are first optimized to achieve fast searching for items. If new items are added later, such systems might become slow. For example, IVFADC requires an initial decision on the number of space partitions . The selection of is sensitive and critical to the performance. A standard convention555https://github.com/facebookresearch/faiss/wiki/Index-IO,-index-factory,-cloning-and-hyper-parameter-tuning is to set . On the other hand, cannot be changed later. The system could become slower if changes significantly. In other words, we must decide even if the final database size is not known, which sometimes frustrates users.

Unlike these existing methods, Rii provides a reconfigure function. If the search becomes slow because of newly added items, coarse centers and assignments are updated by simply running clustering again. The system is automatically optimized to achieve the fastest search for the current number of database items.

Data addition

Let us first explain how to add a new item. Given a new PQ-code , the database PQ-codes are updated using PushBack(, ). A corresponding posting list is also updated by PushBack(). Then, searching can be performed without any modifications, but it may be slower if many items are added. This is because the length of each posting list () can become too long, making the traversal inefficient.

Reconfigure

If the search becomes slow, a reconfigure function can be called (Alg. 4). This function takes the database PQ-codes and a new number of coarse space partitions as inputs. Again, is typically set as for the new . The outputs are updated posting lists and coarse centers. First, the updated coarse centers are computed by running clustering over the PQ-codes using PQk-means (Matsui et al., 2017) (L1). PQk-means efficiently puts the input PQ-codes into several clusters, without decoding the codes for the original -dimensional vectors. Note that clustering can be run for a subset of to make this fast. We set the upper limit of the codes to be clustered as . After new coarse centers are obtained, the posting lists are created by simply finding the nearest center for each PQ-code (L2-4).

The advantage of the reconfigure function is that it can be called whenever the user wishes. The results are deterministic for , because this just runs the clustering over the codes. We will show in Sec. 5.4 that this reconfigure function is especially useful when the database size drastically changes. Another way of looking at this is that we do not need to know the final number of database items when the index structure is built. This is a clear advantage over IVFADC-based methods. In a practical scenario, it will often occur that the number of database items cannot be decided when the system is created. Even in such cases, IVFADC must decide the parameters. This would lead to a suboptimal performance.

Input: ,         # Database PQ-codes
    # # of coarse centers
Output: ,  # Updated posting list
       # Updated coarse centers
1 # Clustering on PQ-codes (Matsui et al., 2017)
2
3 for  do
4      
return
Algorithm 4 Reconfigure

4.4. Connection to IVFADC

The data structure proposed above is similar to the original IVFADC (Jégou et al., 2011a), but has the following fundamental differences.

  • In Rii, each vector is encoded directly, whereas IVFADC encodes a residual between an input vector and a coarse center. This makes the accuracy of Rii slightly inferior to that of IVFADC (see Sec. 5.5), but enables us to store PQ-codes linearly.

  • In Rii, PQ-codes are stored linearly, and their identifiers are stored in posting lists. In IVFADC, both PQ-codes and identifiers are stored in posting lists. This simple modification enables us to run the PQ-linear scan without any additional operations.

  • In IVFADC, coarse centers are a set of -dimensional vectors, whereas coarse centers in Rii are PQ-codes. The advantage of this is that the reconfigure steps become considerably fast with PQk-means. The limitation is that this might decrease the accuracy, but the experimental results show that this degradation is not serious (Sec. 5.5).

4.5. Advanced Encoding

There exist advanced encoding methods for PQ, such as optimized product quantization (OPQ) (Ge et al., 2014; Norouzi and Fleet, 2013), additive quantization (AQ) (Babenko and Lempitsky, 2014; Martinez et al., 2016), and composite quantization (CQ) (Zhang et al., 2014, 2015). Although state-of-the-art accuracy has been achieved by AQ or CQ, it is widely known that they are more complex and time consuming. Therefore, we did not incorporate AQ and CQ in our system.

On the other hand, OPQ provides a reasonable trade-off (slightly slow but with a high accuracy). In OPQ, a rotation matrix is preliminarily trained to minimize the error. In the search phase, an input vector is first rotated with the matrix. The remaining process is exactly the same as PQ. We will show the results of OPQ in Sec. 5.5.

5. Evaluations

All experiments were performed on a server with a 3.6 GHz Intel Xeon CPU (six cores, 12 threads) and 128 GB of RAM. For a fair comparison, we employed a single-thread implementation for the search. Rii is implemented by C++ with a Python interface, All source codes are publicly available666https://github.com/matsui528/rii

5.1. Datasets

The various methods were evaluated using the following datasets:

  • SIFT1M (Jégou et al., 2011b) consists of 128D SIFT feature vectors extracted from several images. It provides 1,000,000 base, 10,000 query, and 100,000 training vectors.

  • GIST1M (Jégou et al., 2011b) consists of 960D GIST feature vectors extracted from several images. It provides 1,000,000 base, 1,000 query, and 500,000 training vectors.

  • Deep1B (Babenko and Lempitsky, 2016) consists of 96D deep features extracted from the last FC layer of GoogLeNet (Szegedy et al., 2015) for one billion images. It provides 1,000,000,000 base, 10,000 query, and 1,000,000 (we used the top 1M from the whole training branch) training vectors.

The code words of Rii and Faiss were preliminarily trained using the training data. The search is conducted over the base vectors.

5.2. Methods

We compare our Rii method with the following existing methods:

  • Annoy (Bernhardsson, 2018): A random projection forest-based system. Because Annoy is easy to use (fewer parameters, intuitive interface, no training steps, and easy IO with a direct mmap design), it is the baseline for million-scale data.

  • FALCONN (Razenshteyn and Schmidt, 2018): Highly optimized LSH (Andoni et al., 2015). FALCONN is regarded as a representative state-of-the-art LSH-based method.

  • NMSLIB (Naidan et al., 2018): Highly optimized ANN library with the support of non-metric spaces (Boytsov and Naidan, 2013). This library includes several algorithms, and we used Hierarchical Navigable Small World (HNSW) (Malkov et al., 2014; Malkov and Yashunin, 2016) in this study. NMSLIB with HNSW is the current state-of-the-art for million-scale data (Aumüller et al., 2017; Bernhardsson et al., 2018).

  • Faiss (Jégou et al., 2018): A collection of highly-optimized PQ-based methods. This library includes IVFADC (Jégou et al., 2011a), OPQ (Ge et al., 2014), inverted multi-index (Babenko and Lempitsky, 2015a), and polysemous codes (Douze et al., 2016). Some of these are implemented using the GPU as well (Johnson et al., 2017). In particular, we compared Rii with the basic IVFADC, which is one of the fastest options. Note that only Faiss and Rii can handle billion-scale data, because PQ-based methods are memory efficient.

5.3. Subset Search

We first present the results for searching over a subset of the whole database. This is the main function that the proposed Rii method provides. The conclusion is that Rii always remains fast, whereas existing methods become considerably slower, especially if the size of the target subset is small. We first explain the task, and then introduce a post-checking module through which existing methods can conduct a subset search. Finally, we present the results.

Task

The task is defined as follows. We randomly select integers from , sort them, and construct the target indices . For each query, we run the search and find the top- results. All the results must be members of . The runtime per query was reported with several combinations of and . The evaluation was conducted using the SIFT1M dataset (), with .

Post-checking module

Because none of the existing methods provide a subset search functionality, we implemented a straightforward post-checking module in order to enable the existing methods to perform a subset search. Alg. 5 shows the pseudocode. This module takes a query function , a query vector , target identifiers , and the number of returned items as inputs. The query function returns the identifiers of the closest items, given and . This is an existing method such as Annoy. First, the output identifier set is prepared (L1). The number of returned items for each iteration, , is first initialized (L2). Then, the search begins with an infinite loop. The top- items are searched using , and the results are stored in the temporal buffer (L4). For each identifier in , if has already been checked, the loop continues (L6-7). This is actually achieved by starting a for loop with some offsets over , so that the first already-checked elements up to a certain number are not traversed. If is included in , we store it in the output set (L8-9). The algorithm finishes if the enough () items are found (L10-11). If an insufficient number of items are found, then is updated to a larger number by simply multiplying a constant value (L12). The search continues with the updated until items are found.

With this module, searching over a target subset is made available for the existing methods. Note that cannot always return items when is large. This depends on the design of the query function, and some methods have a limit on in order not to make the search too slow. We found that FALCONN and NMSLIB do not return items if is large. Therefore, we compared Rii with Annoy using the post-checking module (Annoy + PC).

Input: ,                         # Query function
,                 # Query vector
,    # Target subset identifiers
,     # # of returned items
Output:  # is sorted
1  # An array of integers
2
3 while 1 do
4        # Return top- results
5       for  do
6             if  has been already checked then
7                   continue
8            if  then
9                  
10            if  then
11                   return
12            
13        # User defined constant value
Algorithm 5 Post-checking module for existing methods.

Results

Fig. 5 illustrates the results. We point out the following:

  • Rii was fast under all conditions (less than 2 ms/query). We can conclude that Rii was stable and effective for the subset-search.

  • As with IVFADC, Rii is robust against .

  • Annoy + PC became drastically slow for small , which is further highlighted when is large. This is an obvious result, because the while loop (L3 in Alg. 5) must be repeated several times for large . Here, can be even . ANN systems are usually not designed to handle such values.

Figure 5. Subset search using the SIFT1M dataset over 10 queries. Note that .

5.4. Robustness Against Data Addition

We describe the experiments for our other main function, reconfigure. The conclusion is that Rii becomes fast by using reconfigure, even after many new vectors are added. First, the task is explained, then the results are presented. Here, we used the Deep1B dataset to demonstrate the robustness against billion-scale data.

Task

The index is first constructed using vectors with , and then the runtime is evaluated. Next, new items are added to the index, so that the final becomes . Then, the runtime is evaluated in two ways: (1) a search is performed with , and (2) the data structure is updated using the reconfigure function with , and then the search is conducted. We run this experiment with the final as , , and .

Results

Fig. 6 illustrates the result. It is clear that the search becomes dramatically faster after the reconfigure function is called. For example, if the user keeps the same data structure after new items are added, the search takes an average of 3.9 ms. This can be made faster after applying the reconfigure function.

Most importantly, because the data structure can be always adjusted for the new , the user need not face the burden of selecting when the system is constructed. This is a clear advantage over the other existing methods. Note that the runtime for adding vectors was 109 s, and that of the reconfigure function with was 111 s. These times can be considered moderate.

Figure 6. The runtime performance with and without the reconfigure function over the Deep1B dataset, where , , and .

5.5. Comparison with Existing Methods

Finally, we compare Rii (and its variant Rii-OPQ) with Annoy, FALCONN, NMSLIB (HNSW), and Faiss (IVFADC), using SIFT1M and GIST1M. The conclusion is that our Rii method achieved a comparable performance to the state-of-the-art system Faiss. Note that the searches were conducted over the whole datasets.

The accuracy was measured using Recall@1, which measures the fraction of queries for which the ground truth nearest neighbor is returned within the top-1 result. The average Recall@1 over the query set is reported. We evaluated the methods with several parameter combinations, and report the results with a fixed Recall@1 (0.65 for SIFT1M and 0.5 for GIST1M) for a fair comparison. Because the ranges of some parameters are discrete, we cannot achieve an exact target Recall@1. Thus, the target Recall@1 was selected as best as possible as a value that all methods can achieve.

The disk consumption of the index data structure is also reported. This was measured by storing the data structure on the disk and checking its size in bytes. Note that the runtime (peak-time) memory consumption is the more important measure, but measuring the peak-time memory usage is not always stable, and can vary depending on the computer. Thus, we report the disk space instead, which is reproducible and strongly related to the memory consumption. The runtime of building the data structure is also reported.

Dataset Method Parameters Recall@1 (fixed) Runtime/query Disk space Build time
SIFT1M Annoy (Bernhardsson, 2018) , 0.67 0.18 ms 1703 MB 899 sec
FALCONN (Razenshteyn and Schmidt, 2018; Andoni et al., 2015) 0.63 0.87 ms - 1.8 sec
NMSLIB (HNSW) (Naidan et al., 2018; Boytsov and Naidan, 2013; Malkov and Yashunin, 2016) 0.67 0.043 ms 669 MB 436 sec
Faiss (IVFADC) (Jégou et al., 2018, 2011a) 0.67 0.61 ms 73 MB 30 sec
Rii (proposed) 0.64 0.73 ms 69 MB 82 sec
Rii-OPQ (proposed) 0.65 0.82 ms 69 MB 85 sec
GIST1M Annoy (Bernhardsson, 2018) , 0.49 1.2 ms 5023 MB 2088 sec
FALCONN (Razenshteyn and Schmidt, 2018; Andoni et al., 2015) 0.53 8.6 ms - 7.2 sec
NMSLIB (HNSW) (Naidan et al., 2018; Boytsov and Naidan, 2013; Malkov and Yashunin, 2016) 0.49 0.19 ms 3997 MB 1576 sec
Faiss (IVFADC) (Jégou et al., 2018, 2011a) 0.52 3.8 ms 253 MB 51 sec
Rii (proposed) 0.45 3.2 ms 246 MB 353 sec
Rii-OPQ (proposed) 0.50 3.8 ms 249 MB 388 sec
Table 2. Comparison to existing methods using SIFT1M/GIST1M. Note that for all methods. Unless explicitly denoted, we adopt the default parameters for each method. The bold fonts indicate the best scores among the methods.

Table 2 presents the results. We summarize our findings:

  • Rii was comparable with the state-of-the-art system Faiss. In particular, although our method is basically an approximation of IVFADC, the decrease in the accuracy is not significant.

  • Rii was the most memory efficient among the methods. The measured value is almost same as the theoretically predicted value (68 MB against 69 MB and 244 MB against 249 MB).

  • If we compare Rii and Rii-OPQ, Rii-OPQ was slightly slower but a little more accurate with the same parameter settings.

  • Annoy achieved the second fastest result. Because Annoy supports the direct memory map system, the construction required some time and consumed a relatively large disk space.

  • FALCONN achieved a comparable (or slightly slower) performance to Faiss/Rii. We note that the building cost of FALCONN is considerably smaller than for other methods. As FALCONN does not provide IO functions, we did not report the disk space.

  • As reported in the benchmark (Aumüller et al., 2017; Bernhardsson et al., 2018), NMSLIB achieved the fastest performance. On the other hand, the building time and memory consumption are inferior relative to Faiss/Rii.

  • The results for SIFT1M and GIST1M follow similar tendencies.

6. Application

We present an application to highlight the subset search function of Rii. For this demonstration, we leverage the data of The Metropolitan Museum of Art (MET) Open Access777https://github.com/metmuseum/openaccess. This dataset contains more than 420,000 items from MET, with both the image and extensive metadata for each item (Table 3). From this data, we select 201,998 items that are provided with the Creative Common license. For each image, we extracted a 1,920-dimensional activation of last average pooling layer of the DenseNet-201 (Huang et al., 2017) architecture trained with ImageNet. The features are stored in Rii with . Several meta-information is stored in a table using Pandas888https://pandas.pydata.org/, which is a popular on-memory data management system for Python.

Fig. 7 demonstrates the system, including Python codes and the search results. The metadata and DenseNet vectors are first read. Then, the search is conducted based on the metadata. Here, the items that were created before A.D. 500 in Egypt are specified. Next, the target identifiers are prepared. This is simply a set of IDs of the selected items. The image-based search is then conducted over them. The query here is Chinese tapestry. We can find similar items to the Chinese tapestry from the museum items in ancient Egypt.

As this demonstration reveals, the search using the target subset is a general problem setting. Rii can solve this type of problem easily. As Sec. 5.3 shows, existing methods using the late checking module do not perform well when is small. For example, in this case the result of the metadata search can have any number of items. Rii can handle a subset search for any size of .

ID title date country
0 Bust of Abraham Lincoln 1876 United States
1 Acorn Clock 1847 United States
Table 3. Metadata of MET dataset. Each item has several attributes, such as title and data.
{minted}

[mathescape, breaklines, breakindent=20pt, fontsize=, frame=single, ]python import pandas as pd import rii

# Read data df = pd.read_csv(’metadata.csv’) engine = pkl.load(open(’rii_densenet.pkl’, ’rb’))

# Metadata search (13.5 ms) S = df[(df[’data’]¡500) & (df[’country’]==’Egypt’)][’ID’] S = np.sort(np.array(S)) # Target identifiers

# ANN for subset (2 ms) q = # Read query feature result = engine.query(q=q, target_ids=S, topk=3)

Query The nearest The 2nd nearest The 3rd nearest
Figure 7. Demonstration of the subset search. The target items are first selected using metadata information. Then, an image-based search is conducted over the target items.

7. Conclusions

We developed an approximate nearest neighbor search method, called Rii. Rii provides the two functions of searching over a subset and a reconfigure function for newly added vectors. Extensive comparisons showed that Rii achieved a comparable performance to state-of-the art systems, such as Faiss.

Note that the latest systems incorporate HNSW for the coarse assignment of IVFADC (Baranchuk et al., 2018; Douze et al., 2018). Our Rii architecture can be combined to them, but that will be remained as a future work.

Acknowledgments: This work was supported by JST ACT-I Grant Number JPMJPR16UO, Japan.

References

  • (1)
  • Andoni et al. (2015) Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. 2015. Practical and Optimal LSH for Angular Distance. In Proc. NIPS.
  • André et al. (2015) Fabien André, Anne-Marie Kermarrec, and Nicolas Le Scouarnec. 2015. Cache Locality is Not Enough: High-performance Nearest Neighbor Search with Product Quantization Fast Scan. In Proc. VLDB.
  • André et al. (2017) Fabien André, Anne-Marie Kermarrec, and Nicolas Le Scouarnec. 2017. Accelerated Nearest Neighbor Search with Quick ADC. In Proc. ICMR.
  • Aumüller et al. (2017) Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2017. ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. In Proc. SISAP.
  • Babenko and Lemitsky (2017) Artem Babenko and Victor Lemitsky. 2017. AnnArbor: Approximate Nearest Neighbors Using Arborescence Coding. In Proc. IEEE ICCV.
  • Babenko and Lempitsky (2014) Artem Babenko and Victor Lempitsky. 2014. Additive Quantization for Extreme Vector Compression. In Proc. IEEE CVPR.
  • Babenko and Lempitsky (2015a) Artem Babenko and Victor Lempitsky. 2015a. The Inverted Multi-Index. IEEE TPAMI 37, 6 (2015), 1247–1260.
  • Babenko and Lempitsky (2015b) Artem Babenko and Victor Lempitsky. 2015b. Tree Quantization for Large-Scale Similarity Search and Classification. In Proc. IEEE CVPR.
  • Babenko and Lempitsky (2016) Artem Babenko and Victor Lempitsky. 2016. Efficient Indexing of Billion-Scale Datasets of Deep Descriptors. In Proc. IEEE CVPR.
  • Baranchuk et al. (2018) Dmitry Baranchuk, Artem Babenko, and Yury Malkov. 2018. Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors. In Proc. ECCV.
  • Bernhardsson (2018) Erik Bernhardsson. 2018. Annoy. https://github.com/spotify/annoy.
  • Bernhardsson et al. (2018) Erik Bernhardsson, Martin Aumüller, and Alexander Faithfull. 2018. ann-benchmarks. https://github.com/erikbern/ann-benchmarks.
  • Blalock and Guttag (2017) Davis W. Blalock and John V. Guttag. 2017. Bolt: Accelerated Data Mining with Fast Vector Compression. In Proc. ACM KDD.
  • Boytsov and Naidan (2013) Leonid Boytsov and Bilegsaikhan Naidan. 2013. Engineering Efficient and Effective Non-metric Space Library. In Proc. SISAP.
  • Datar et al. (2004) Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proc. SCG.
  • Douze et al. (2016) Matthijs Douze, Hervé Jégou, and Florent Perronnin. 2016. Polysemous Codes. In Proc. ECCV.
  • Douze et al. (2018) Matthijs Douze, Alexandre Sablayrolles, and Hervé Jégou. 2018. Link and code: Fast indexing with graphs and compact regression codes. In Proc. IEEE CVPR.
  • Ge et al. (2014) Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2014. Optimized Product Quantization. IEEE TPAMI 36, 4 (2014), 744–755.
  • Gudmundsson et al. (2018) Gylfi Þór Gudmundsson, Björn Þór Jónsson, Laurent Amsaleg, and Michael J. Franklin. 2018. Prototyping a Web-Scale Multimedia Retrieval Service Using Spark. ACM TOMM 14, 3s (2018), 65:1–65:24.
  • Heo et al. (2016) Jae-Pil Heo, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Sung-Eui Yoon. 2016. Shortlist Selection With Residual-Aware Distance Estimator for K-Nearest Neighbor Search. In Proc. IEEE CVPR.
  • Heo et al. (2014) Jae-Pil Heo, Zhe Lin, and Sung-Eui Yoon. 2014. Distance Encoded Product Quantization. In Proc. IEEE CVPR.
  • Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2017. Densely Connected Convolutional Networks. In Proc. IEEE CVPR.
  • Iwamura et al. (2013) Masakazu Iwamura, Tomokazu Sato, and Koichi Kise. 2013. What Is the Most Efficient Way to Select Nearest Neighbor Candidates for Fast Approximate Nearest Neighbor Search?. In Proc. IEEE ICCV.
  • Jain et al. (2016) Himalaya Jain, , Patrick Pérez, Rémi Gribonval, Joaquin Zepeda, and Hervé Jégou. 2016. Approximate Search with Quantized Sparse Representations. In Proc. ECCV.
  • Jégou et al. (2018) Hervé Jégou, Matthijs Douze, and Jeff Johnson. 2018. Faiss. https://github.com/facebookresearch/faiss.
  • Jégou et al. (2011a) Hervé Jégou, Matthijis Douze, and Cordelia Schmid. 2011a. Product Quantization for Nearest Neighbor Search. IEEE TPAMI 33, 1 (2011), 117–128.
  • Jégou et al. (2011b) Hervé Jégou, Romain Tavenard, Matthijs Douze, and Laurent Amsaleg. 2011b. Searching in One Billion Vectors: Re-rank with Source Coding. In Proc. IEEE ICASSP.
  • Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale Similarity Search with GPUs. CoRR abs/1702.08734 (2017).
  • Kalantidis and Avrithis (2014) Yannis Kalantidis and Yannis Avrithis. 2014. Locally Optimized Product Quantization for Approximate Nearest Neighbor Search. In Proc. IEEE CVPR.
  • Liu et al. (2017) Yingfan Liu, Hong Cheng, and Jiangtao Cui. 2017. PQBF: I/O-Efficient Approximate Nearest Neighbor Search by Product Quantization. In Proc. CIKM.
  • Lv et al. (2007) Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. 2007. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search. In Proc. VLDB.
  • Malkov et al. (2014) Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, and Vladimir Krylov. 2014. Approximate Nearest Neighbor Algorithm Based on Navigable Small World Graphs. Inf. Syst. 45 (2014), 61–68.
  • Malkov and Yashunin (2016) Yury A. Malkov and Dmitry A. Yashunin. 2016. Efficient and Robust Approximate Nearest Neighbor Search using Hierarchical Navigable Small World Graphs. CoRR abs/1603.09320 (2016).
  • Martinez et al. (2016) Julieta Martinez, Joris Clement, Holger H. Hoos, and James J. Little. 2016. Revisiting Additive Quantization. In Proc. ECCV.
  • Matsui et al. (2017) Yusuke Matsui, Keisuke Ogaki, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2017. PQk-means: Billion-scale Clustering for Product-quantized Codes. In Proc. MM.
  • Matsui et al. (2018a) Yusuke Matsui, Yusuke Uchida, Hervé Jégou, and Shin’ichi Satoh. 2018a. A Survey of Product Quantization. ITE Transactions on Media Technology and Applications 6, 1 (2018), 2–10.
  • Matsui et al. (2018b) Yusuke Matsui, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2018b. PQTable: Non-exhaustive Fast Search for Product-quantized Codes using Hash Tables. IEEE TMM 20, 7 (2018), 1809–1822.
  • Muja and Lowe (2014) Marius Muja and David G. Lowe. 2014. Scalable Nearest Neighbor Algorithms for High Dimensional Data. IEEE TPAMI 36, 11 (2014), 2227–2240.
  • Naidan et al. (2018) Bilegsaikhan Naidan, Leonid Boytsov, Yury Malkov, David Novak, and Ben Frederickson. 2018. Non-Metric Space Library (NMSLIB). https://github.com/searchivarius/nmslib.
  • Norouzi and Fleet (2013) Mohammad Norouzi and David J. Fleet. 2013. Cartesian k-means. In Proc. IEEE CVPR.
  • Razenshteyn and Schmidt (2018) Ilya Razenshteyn and Ludwig Schmidt. 2018. FALCONN - FAst Lookups of Cosine and Other Nearest Neighbors. https://github.com/FALCONN-LIB/FALCONN.
  • Spyromitros-Xioufis et al. (2014) Eleftherios Spyromitros-Xioufis, Symeon Papadopoulos, Ioannis (Yiannis) Kompatsiaris, Grigorios Tsoumakas, and Ioannis Vlahavas. 2014. A Comprehensive Study Over VLAD and Product Quantization in Large-Scale Image Retrieval. IEEE TMM 16, 6 (2014), 1713–1728.
  • Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper With Convolutions. In Proc. IEEE CVPR.
  • Wang et al. (2015) Jianfeng Wang, Jingdong Wang, Jingkuan Song, Xin-Shun Xu, Heng Tao Shen, and Shipeng Li. 2015. Optimized Cartesian K-Means. IEEE TKDE 27, 1 (2015), 180–192.
  • Wieschollek et al. (2016) Patrick Wieschollek, Oliver Wang, Alexander Sorkine-Hornung, and Hendrik P. A. Lensch. 2016. Efficient Large-Scale Approximate Nearest Neighbor Search on the GPU. In Proc. IEEE CVPR.
  • Xia et al. (2013) Yan Xia, Kaiming He, Fang Wen, and Jian Sun. 2013. Joint Inverted Indexing. In Proc. IEEE ICCV.
  • Zhang et al. (2018) Jialiang Zhang, Soroosh Khoram, and Jing Li. 2018. Efficient Large-Scale Approximate Nearest Neighbor Search on OpenCL FPGA. In Proc. IEEE CVPR.
  • Zhang et al. (2014) Ting Zhang, Chao Du, and Jingdong Wang. 2014. Composite Quantization for Approximate Nearest Neighbor Search. In Proc. ICML.
  • Zhang et al. (2015) Ting Zhang, Guo-Jun Qi, Jinhui Tang, and Jingdong Wang. 2015. Sparse Composite Quantization. In Proc. IEEE CVPR.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
254302
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description