A neural network catalyzer for multidimensional similarity search
Abstract
This paper aims at learning a function mapping input vectors to an output space in a way that improves highdimensional similarity search.
As a proxy objective, we design and train a neural network that favors uniformity in the spherical output space, while preserving the neighborhood structure after the mapping. For this purpose, we propose a new regularizer derived from the Kozachenko–Leonenko differential entropy estimator and combine it with a localityaware triplet loss.
Our method operates as a catalyzer for traditional indexing methods such as locality sensitive hashing or iterative quantization, boosting the overall recall.
Additionally, the network output distribution makes it possible to leverage structured quantizers with efficient algebraic encoding, in particular spherical lattice quantizers such as the Gosset lattice .
Our experiments show that this approach is competitive with stateoftheart methods such as optimized product quantization.
1 Introduction
Recent work kraska17learnedindex () proposed to leverage the patternmatching ability of machine learning algorithms to improve traditional index structures such as Btrees or Bloom filters, with encouraging results. In their onedimensional case, an optimal BTree can be constructed if the cumulative density function (CDF) is known, and thus Kraska et al. kraska17learnedindex () learn the CDF using a neural network. We emphasize that the CDF itself is a mapping between an arbitrary input distribution and a uniform distribution in . In this work, we wish to generalize such an approach to multidimensional spaces: we aim at learning a function that improves or simplifies the subsequent highdimensional indexing method.
Similarity search methods often rely on various forms of learning machinery jegou11pq (); gong2013iterative (); wang2014hashing (); douze2016polysemous (), in particular there is a substantial body of literature on methods producing compact codes. Another line of work shows the interest of neural networks in the context of binary hashing liong2015deep (); jain17subic (). Yet the problem of jointly optimizing a coding stage and a neural network remains essentially unsolved, partly because it is difficult to optimize through a discretization function. For this reason, most efforts have been devoted to networks producing binary codes, for which optimization tricks exist, such as soft binarization or stochastic relaxation. However it is difficult to improve over more powerful codes such as those produced by product quantization jegou11pq (), and recent solutions addressing product quantization require complex optimization procedures klein2017defense ().
In order to circumvent this problem, we propose a drastic simplification of learning algorithms for indexing. Instead of trying to optimize through a discretization layer, we learn a mapping such that the output follows the distribution under which the subsequent discretization method, either binary or a more general quantizer, is optimal. In other terms, instead of trying to adapt an indexing structure to the data, we adapt the data to the index. As a side note, many similarity search methods are implicitly designed for the range search problem (or near neighbor, as opposed to nearest neighbor indyk1998approximate (); andoni2006near ()), that aims at finding all vectors whose distance to the query vector is below a fixed threshold. For realworld highdimensional distributions, range search typically returns either no neighbors or too many. The discrepancy between near– and nearest– neighbors is much lower with uniform data.
Our proposal requires to jointly optimize two antithetical criteria. First, we need to ensure that neighbors are preserved by the mapping, using a vanilla ranking loss usunier2009ranking (); chechik2010large (); wang2014learning (). Second, the training must favor a uniform output. This suggests a regularization similar to maximum entropy pereyra2017regularizing (), except that in our case we consider a continuous output space. We therefore propose to cast an existing differential entropy estimator into a regularization term, which plays the same “distributionmatching” role as the KullbackLeiber term of variational autoencoders doersch2016tutorial ().
Our approach is illustrated by Figure 1. We summarize our contributions as follows:

We introduce an approach for multidimensional indexing that maps the input data to an output space in which indexing is easier. It learns a neural network that plays the role of an adapter for subsequent similarity search methods.

For this purpose we introduce a loss derived from the KozachenkoLeonenko differential entropy estimator to favor uniformity in the spherical output space.

The endtoend performance of existing techniques like Locality Sensitive Hashing (LSH) with short codes charikar02lsh () or Iterative Quantization gong2013iterative () is improved significantly by our mapping.

Our learned mapping makes it possible to leverage, in particular for nonuniform data, spherical lattice quantizers with competitive coding properties and efficient algebraic encoding.
This paper is organized as follows. Section 2 discusses related works. Our neural model and the optimization scheme are introduced in Section 3. Section 4 details how we combine this strategy with lattice assignment to produce compact codes. The experimental section 5 evaluates our approach.
input 

2 Related work
Generative modeling. Models such as Generative Adversarial Networks (GANs) goodfellow14gan () or Variational AutoEncoders (VAEs) kingma13vae () aim at learning a mapping between an isotropic Gaussian distribution and the empirical distribution of a training set. Our approach maps an empirical input distribution to a uniform distribution on the spherical output space. Another distinction is that GANs learn a unidirectional mapping from the latent code to an image (decoder), whereas VAEs learn a bidirectional mapping (encoder  decoder). In our work, we focus on learning the encoder, which is the neural network preprocessing of input vectors for subsequent indexing.
Dimensionality reduction and representation learning. There is a large body of literature on the topic of dimensionality reduction, see for instance the review by van Der Maaten et al. van2009dimensionality (). Relevant work includes the stochastic neighbor embedding hinton03sne () and the subsequent tSNE approach vandermaaten08tsne (), which is tailored to lowdimensional spaces for visualisation purposes. Both works are nonlinear dimensionality reduction aiming at preserving the neighborhood in the output space.
Learning to index and quantize. The literature on product compact codes for indexing is most relevant to our work, see wang2014hashing (); wang2016learning () for an overview of the topic. Early popular highdimensional approximate neighbor methods, such as Locality Sensitive Hashing indyk1998approximate (); GIM99 (); charikar02lsh (); andoni2006near (), were mostly relying on statistical guarantees without any learning stage. This lack of data adaptation was subsequently addressed by several works. The Iterative quantization (ITQ) gong2013iterative () modifies the coordinate system to improve the binarization, while methods inspired by compression jegou11pq (); babenko2014additive (); ZQTW15 (); jain2016quantizedsparse () have gradually emerged as strong competitors for estimating distances or similarities with compact codes. While most of these works aim at reproducing target (dis)similarity, some recent works directly leverage semantic information in a supervised manner with neural networks liong2015deep (); jain17subic (); klein2017defense (); sablayrolles2017should ().
Lattices, also known as Euclidean networks, are discrete subsets of the Euclidean space that are of particular interest due to their space covering and sphere packing properties conway2013sphere (). They also have excellent discretization properties under some assumptions about the distribution, and most interestingly the closest point of a lattice is determined efficiently thanks to algebraic properties ran1998efficient (). This is why lattices have been proposed jegou2008query () as hash functions in LSH. However, for realworld data, lattices waste capacity because they assume that all regions of the space have the same density pauleve2010locality (). In this paper, we are interested in spherical lattices because of their bounded support.
Entropy regularization appears in many areas of machine learning and indexing. For instance, Pereya et al. pereyra2017regularizing () argue that penalizing confident output distributions is an effective regularization. Another proposal by Bojanowski et al. bojanowski2017unsupervised () in an unsupervised learning context, is to spread the output by enforcing input images to map to points drawn uniformly on a sphere. Interestingly, most recent works on binary hashing introduces some form of entropic regularization. Recent works on binary hashing, such as deep hashing liong2015deep (), typically employ a regularization term that increases the marginal entropy of each bit. SUBIC jain17subic () extends this idea to onehot codes.
3 Our approach: Learning the catalyzer
As discussed in the introduction, our proposal is inspired by prior work for onedimensional indexing kraska17learnedindex (). However their approach based on unidimensional density estimation can not be directly translated to the multidimensional case. Our strategy is to train a neural network that maps vectors from a dimensional space to a dimensional space. We perform the training on input points . We first simplify the problem by constraining the output representation to lie on the hypersphere via normalization. On we can define a bounded, uniform and isotropic distribution. The trunk of the network itself is a simple MultiLayer Perceptron (MLP)rosenblatt58perceptron () with 2 hidden layers, and uses rectified linear units as a nonlinearity.
3.1 KoLeo: Differential entropy regularizer
Let us first introduce our regularizer, which we design to spread out points uniformly across . With the knowledge of the density of points , we could directly maximize the differential entropy . Given only samples , we instead use an estimator of the differential entropy as a proxy. It was shown by Kozachenko and Leononenko (see e.g. beirlant97entropy ()) that defining , the differential entropy of the distribution can be estimated by
(1) 
where and are two constants that depend on the number of samples and the dimensionality of the data . Ignoring the affine components, we define our entropic regularizer as
(2) 
This loss also has a satisfactory geometric interpretation: closest points are pushed away, with a nonlinearity that is nondecreasing and concave, which ensures diminishing returns: as points get away from each other, the marginal impact of increasing the distance becomes smaller.
Remarks.
We are essentially seeking the same effect as Pereyra et al. pereyra2017regularizing (), i.e., that the penalty terms tend to make the output more uniform. Note however that in our case, we have a continuous output and therefore a discrete entropy term is not appropriate. Citing Doersh doersch2016tutorial (), “it is interesting to view the KullbackLeibler as a regularization term”, one could also draw a relationship with the KullbackLeibler (KL) terms occurring in VAEs and tSNE, in that the formula of the KL becomes very similar to Eqn. 2 when considering a uniform distribution. A key difference is the quantity , which is computed differently and does not need to have a probabilistic interpretation in our case.
3.2 Rank preserving loss
We enforce the outputs of the neural network to follow the same neighborhood structure as in the input space by adopting the margin triplet loss chechik2010large (); wang2014learning ()
(3) 
where is a query, a positive match, a negative match and is the desired margin. The positive matches are obtained by computing the nearest neighbors of each point in the training set in the input space. The negative matches are generated by taking the th nearest neighbor of in . In order to speed up the learning, we compute the th nearest neighbor of every point in the dataset at the end of each epoch, and use these as negatives for the next epoch. Our overall loss combines the triplet loss and the entropy regularizer, as
(4) 
where the parameter controls the tradeoff between ranking quality and uniformity.
3.3 Discussion
Choice of .
Figure 1 was produced by our method on da toy dataset adapted to the disk as the output space. Without the regularization term, neighboring points tend to collapse and most of the output space is not exploited. If we quantize this output with a regular quantizer, many Voronoi cells are empty, meaning that we waste coding capacity. In contrast, if we solely rely on the entropic regularizer, the neighbors are poorly preserved. Interesting tradeoffs are achieved when setting the parameter by a crossvalidation that includes the subsequent quantization stage.
Nearest neighbor vs range search.
Figure 2 shows how our method achieves a better agreement between range search and knearest neighbors search on real data. In this experiment, we consider different thresholds for the range search and perform a set of queries for each . Then we measure how many vectors we must return, on average, to achieve a certain recall in terms of the nearest neighbors in the original space. Without our mapping, there is a large variance on the number of results for a given . In contrast, after the mapping it is possible to use a unique threshold to find most neighbors.
Visualization of the output distribution.
While Figure 1 illustrates our method with the 2D disk as an output space, we are interested in mapping input samples to a higher dimensional hypersphere. Figure 3 proposes a visualization of the highdimensional density from a different viewpoint, with the Deep1M dataset mapped in dimensions. We sample 2Dplanes randomly in and project the dataset points on them. For each row, the 5 figures are the angular histograms of the points with a polar parametrization of this plane. The area inside the curve is constant and proportional to the number of samples . A uniform angular distribution produces a centered disk, and less uniform distributions look like unbalanced potatoes.
The densities we represent are marginalized, so if the distribution looks nonuniform then it is nonuniform in dimensional space, but the reverse is not true. Yet one can compare the results obtained for different regularization coefficients, which shows that our regularizer has a strong uniformizing effect on the mapping, ultimately resembling that of a uniform distribution for .
4 Compact codes with lattices
In this section we describe how we leverage lattices with our catalyzer. Our motivation is as follows: as discussed by Paulevé et al. pauleve2010locality (), lattices impose a rigid partitioning the feature space, which is suboptimal for arbitrary distributions, see Figure 1. In contrast, lattices offer excellent quantization properties for a uniform distribution conway2013sphere (). Thanks to our regularizer, we are likely to better meet the condition of optimality in the output space, thereby making lattices an attractive choice.
Note that, prior works employing lattices andoni2006near (); pauleve2010locality () for indexing were considering a framework different from ours, inspired by the E2LSH variant of LSH DIIM04 () and often referred to as the “cellprobe model”: lattices are used to select regions likely to contain neighbors, and a subsequent stage compares the query vectors with all the vectors assigned to the same lattice point. In contrast, we consider the problem of producing compact codes that can be used to compare compressed vectors charikar02lsh (); jegou11pq (). The relevant tradeoff is the retrieval accuracy vs the size of the code in bits.
4.1 Spherical lattice
From now on, we focus on spherical lattices, which are regular subsets of points on a hypersphere centered at the origin. One practical way conway2013sphere () to construct a good spherical quantizer is to intersect a topperforming Euclidean lattice with such a hypersphere. In this paper, we focus on the dimensional lattice. We denote by the dimensional hypersphere of radius .
The first hypersphere intersecting is . The intersection is a regular subset of 240 points, which are projected back to the hypersphere by normalization. We obtain a larger number of points on the hypersphere by increasing the radius. The hyperspheres having a nonvoid intersection with are those such that is an even integer. They offer many advantages:

They inherit the remarkable compactness and quantization qualities of .

The assignment is performed efficiently, see supplementary material for details.

The lattice points are enumerated and constructed without storing them in a table. This is especially important for large radii, for which the size of can be in the millions.
4.2 Product space output
As the radius grows, our quantization procedure becomes more accurate, but the performance of our indexing system is eventually limited by the output dimensionality of our neural network : the 8dimensional space may be too small to preserve the neighborhood, as we observed in our experiments.
We thus propose to get more flexibility by considering the Cartesian product of lattices
4.3 Compact codes and comparison procedure
In terms of coding, the size of the code is bits. In practice we only consider (note that ). In order to assign a vector in , we first compute , and find the nearest vector on using the fast assignment operation, which formally minimizes:
(5) 
Given a query and its representation , we approximate the similarity between and using the code: (asymmetric comparison jegou11pq ()).
5 Experiments
This section presents our experimental results. We focus on the class of similarity search methods that represents the database vectors with a compressed representation charikar02lsh (); jegou11pq (); gong2013iterative (); GHKS13 (), which enables to store very large dataset in memory LCL04 (); torralba2008small ().
5.1 Experimental setup
Datasets and metrics.
We carry out our experiments on two public datasets, namely Deep1M and BigAnn1M. Deep1M consists of the first million vectors of the Deep1B dataset babenko2016efficient (). The vectors were obtained by running a convnet on an image collection, reduced to 96 dimensions by principal component analysis and subsequently normalized. We also experiment with the BigAnn1M jegou2011searching (), which consists of SIFT descriptors Lowe04sift (). This dataset is used in many prior works.
Both datasets contain 1M vectors that serve as a reference set, 10k query vectors and a very large training set of which we use k elements for training, and 1M vectors that we use a base to crossvalidate the hyperparameters.
We also perform one experiment on the full Deep1B and BigAnn datasets, that contain 1 billion elements. We evaluate methods with the recall at 10 performance measure, which is the proportion of results that contain the ground truth nearest neighbor when returning the top 10 candidates.
Training.
For all methods, we train our neural network and crossvalidate the hyperparameters on the provided training set, and use a different set of vectors for evaluation. In contrast, some works carry out training on the database vectors themselves ML14 (); malkov2016efficient (); gong2013iterative (), in which case the index is tailored to a particular fixed set of database vectors.
5.2 Model architecture and optimization
We adopt a multilayer perceptron to map between our input and output space. Our model consists of hidden layers, each of which comprises units followed by a ReLU nonlinearity. A final linear layer projects the dataset to the desired output dimension , along with normalization. We use batch normalization ioffe15batchnorm () and train our model with Stochastic Gradient Descent with an initial learning rate of and a momentum of . The learning rate is decayed by a factor of if the training loss does not improve for one epoch. We used 100 epochs of optimization for each experiment. On a CPUonly server with cores, the training with 300k samples takes about three hours.
5.3 Binary hashing: Catalyzing existing methods
Deep1M  BigAnn1M  

bits per vector  16  32  64  128  16  32  64  128 
baseline LSH  0.8  4.9  14.6  32.2  0.3  2.6  9.5  24.4 
catalyst + LSH  0.9  5.6  16.4  35.3  1.0  5.5  18.2  39.4 
baseline ITQ  1.0  7.3  21.0  n/a  1.5  8.5  22.8  41.8 
catalyst + ITQ  2.0  10.4  25.2  46.5  2.3  11.0  28.3  50.6 
We first show the interest of our method as a catalyzer for two popular binary hashing methods charikar02lsh (); gong2013iterative ():
 LSH

maps Euclidean vectors to binary codes that are then compared with Hamming distance. A set of fixed projection directions are drawn randomly and isotropically in , and each vector is encoded into bits by taking the sign of the dot product with each direction. The theoretical framework of LSH guarantees that with some probability, the Hamming distance reproduces the cosine similarity (up to the monotonous arccos function). However, it is not adapted to the data since the directions are random. Noticeably, it has trouble differentiating vectors that are in denser areas of the vector distribution and waste some capacity. Balu et al. balu2014beyond () even showed that there are binary codes that cannot be reached.
 ITQ

is another popular hashing method, that improves LSH by using a random rotation rather than projections and that optimizes it to minimize the quantization error.
Table 1 shows how our transformation improves the performance of LSH and ITQ when applied as a preprocessing step before hashing. The catalyst improves the performance by 29 percentage points in all settings from 32 to 128 bits. In particular, ITQ was initially developed and evaluated on range search, and performs poorly for the nearest neighbor search that is evaluated here. However, its performance is significantly boosted by our uniformizing mapping.
5.4 Similarity search with lattice vector quantizers
We now evaluate the latticebased indexing proposed in Section 4, and compare it to more conventional methods based on quantization, namely PQ jegou11pq () and Optimized Product Quantization (OPQ) GHKS13 (). Figure 4 provides a comparison of all these methods. OPQ is in the usual setting where each subvector is assigned one byte, meaning that each individual quantizer has 256 centroids. We use the Faiss johnson2017billion () implementation of OPQ that does not constrain the quantization space to match that of the input space. For our product lattice, we vary the value of to increase the quantizer size, hence generating curves for each value of .
On Deep1M, the lattice quantizer easily outperforms PQ for most code sizes. The lattice quantizer also obtains better performance than OPQ.
Largescale experiments.
We experiment with the full Deep1B (resp. BigAnn) dataset, that contains 1 billion vectors. We use 64 bits per code, obtained with a 8x8 OPQ or a 4head product lattice quantizer with . At that scale, the recall at 10 drops to 26.1% for OPQ and to 30.3% for the lattice quantizer (resp. 21.0% and 21.9%). This means that the precision advantage of the lattice quantizer is maintained at large scale. The encoding of 1 billion vectors takes 131 minutes on a 16core machine, to compare with 33 minutes for OPQ.
Complexity analysis.
The product lattice indexing method proceeds in much the same way as OPQ GHKS13 (). It starts with a transformation stage, which is linear for OPQ, and a MLP in our case. Then we apply a quantization, with kmeans for OPQ, and a lattice for our approach. The complexity of OPQ is therefore a natural comparison choice.
In terms of the quantization complexity, OPQ performs 34 kFLOPS per vector, while the forward pass of our method leads to kFLOPs for (the quantization cost is negligible). If the efficiency is important, we can reduce the hidden layers to components, in which case the performance and efficiency (30 kFLOPS) is comparable to OPQ.
6 Concluding remarks
We have proposed to learn a neural network to map to an output distribution that makes highdimensional indexing more effective. We have demonstrated the interest of our proposal by first considering two popular binary hashing methods, namely LSH and ITQ, which are both significantly improved by our catalyzer.
We also showed a somewhat counterintuitive result: for similarity search with compact codes, it is as good or even better to preprocess the data so that it uses a fixed, rigid, lattice quantizer, than directly quantizing it with an optimized product quantizer. This approach has several benefits: once the data is mapped, we can use simpler indexing structures, with less parameters. Since this result is, to the best our knowledge, the first of its kind, we hope that subsequent works will propose new architectures for such catalyzers.
Acknowledgement
The authors thank Armand Joulin and Piotr Bojanowski for useful comments and discussions.
Decoding on the intersection of and hyperspheres
The lattice is built from the lattice that includes the integer points such that and is even. The vertices of consist of two subsets of points, corresponding to the lattice and another lattice translated to best fill the holes. More precisely, the translated consists of halfinteger points . We are interested in the intersection of with a hypersphere of integer radius , where is even (no point has an odd squared norm): .
We consider “atoms”, 8uplets of nonnegative integers or half integers which sums to an even number, and which squares sum up to . For small values of (we typically use up to ), atoms can be enumerated exhaustively. For example, all the atoms for = 10 are permutations of:
(6) 
The number of atoms as a function of is shown in Figure 5.
All vectors of are a permutation of an atom, with added signs. There are permutations, but the permutation of equal components is irrelevant, which divides the number combinations. For example atom corresponds to nonnegative vectors of . The signs can be set as follows:

for integer atoms, the components except 0 can have any sign. This generates variants, where is the number of 0s in the atom;

for halfinteger atoms, there is no 0, but the number of negative (and positive) values has to be even. Therefore, the number of possible signed variants is .
To find the nearest to a given input , we proceed as follows:

we normalize by taking its absolute value and sorting its components, producing

we loop over the atoms to find the atom that maximizes the dot product with

we revert the permutation on the atom

We assign ’s sign to the corresponding atom components.
Footnotes
 University Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France.
 We could consider lattices directly defined in higher dimensional space, see the lattice bestiary by Conway and Sloane conway2013sphere (). Yet the choice of a product is competitive in all aspects, noticeably quantization performance and efficiency, while other powerful lattices such as the Leech lattice are computationally expensive.
References
 Tim Kraska, Alex Beutel, Ed H. Chi, Jeff Dean, and Neoklis Polyzotis. The case for learned index structures. arXiv preprint arXiv:1712.01208, 2017.
 Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011.
 Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2013.
 Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927, 2014.
 Matthijs Douze, Hervé Jégou, and Florent Perronnin. Polysemous codes. In European Conference on Computer Vision. Springer, 2016.
 Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, Jie Zhou, et al. Deep hashing for compact binary codes learning. In Conference on Computer Vision and Pattern Recognition, volume 1, 2015.
 Himalaya Jain, Joaquin Zepeda, Patrick Perez, and Remi Gribonval. SUBIC: A supervised, structured binary code for image search. In International Conference on Computer Vision, 2017.
 Benjamin Klein and Lior Wolf. In defense of product quantization. arXiv preprint arXiv:1711.08589, 2017.
 Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In ACM symposium on Theory of computing, 1998.
 Alexandr Andoni and Piotr Indyk. Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions. In Symposium on the Foundations of Computer Science, 2006.
 Nicolas Usunier, David Buffoni, and Patrick Gallinari. Ranking with ordered weighted pairwise classification. In International Conference on Machine Learning, 2009.
 Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, 11(Mar), 2010.
 Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. Learning finegrained image similarity with deep ranking. In Conference on Computer Vision and Pattern Recognition, 2014.
 Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
 Carl Doersch. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.
 Moses Charikar. Similarity estimation techniques from rounding algorithms. In ACM symposium on Theory of computing, 2002.
 Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2014.
 Diederik P. Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Laurens Van Der Maaten, Eric Postma, and Jaap Van den Herik. Dimensionality reduction: a comparative review. Journal of Machine Learning Research, 10, 2009.
 Geoffrey Hinton and Sam Roweis. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems, 2003.
 Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tSNE. Journal of Machine Learning Research, 2008.
 Jun Wang, Wei Liu, Sanjiv Kumar, and ShihFu Chang. Learning to hash for indexing big data: a survey. Proceedings of the IEEE, 104(1), 2016.
 Arisitides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimension via hashing. In International Conference on Very Large DataBases, pages 518–529, 1999.
 Artem Babenko and Victor Lempitsky. Additive quantization for extreme vector compression. In Conference on Computer Vision and Pattern Recognition, 2014.
 Ting Zhang, GuoJun Qi, Jinhui Tang, and Jingdong Wang. Sparse composite quantization. In Conference on Computer Vision and Pattern Recognition, June 2015.
 Himalaya Jain, Patrick Pérez, Rémi Gribonval, Joaquin Zepeda, and Hervé Jégou. Approximate search with quantized sparse representations. In European Conference on Computer Vision, October 2016.
 Alexandre Sablayrolles, Matthijs Douze, Nicolas Usunier, and Hervé Jégou. How should we evaluate supervised hashing? In International Conference on Acoustics, Speech, and Signal Processing, 2017.
 John Horton Conway and Neil James Alexander Sloane. Sphere packings, lattices and groups, volume 290. Springer Science & Business Media, 2013.
 Moshe Ran and Jakov Snyders. Efficient decoding of the gosset, coxetertodd and the barneswall lattices. In International Symposium on Information Theory, page 92, 1998.
 Hervé Jégou, Laurent Amsaleg, Cordelia Schmid, and Patrick Gros. Query adaptative locality sensitive hashing. In International Conference on Acoustics, Speech, and Signal Processing, 2008.
 Loïc Paulevé, Hervé Jégou, and Laurent Amsaleg. Locality sensitive hashing: A comparison of hash function types and querying mechanisms. Pattern recognition letters, 31(11), 2010.
 Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. In International Conference on Machine Learning, 2017.
 F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 1958.
 Jan Beirlant, E J. Dudewicz, L Gyor, and E.C. Meulen. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 6, 1997.
 Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab .S. Mirrokni. Localitysensitive hashing scheme based on pstable distributions. In ACM symposium on Theory of computing, 2004.
 Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimized product quantization for approximate nearest neighbor search. In Conference on Computer Vision and Pattern Recognition, 2013.
 Qin Lv, Moses Charikar, and Kai Li. Image similarity search with compact data structures. In International Conference on Information and Knowledge, pages 208–217, November 2004.
 Antonio Torralba, Rob Fergus, and Yair Weiss. Small codes and large image databases for recognition. In Conference on Computer Vision and Pattern Recognition, 2008.
 Artem Babenko and Victor Lempitsky. Efficient indexing of billionscale datasets of deep descriptors. In Conference on Computer Vision and Pattern Recognition, 2016.
 Hervé Jégou, Romain Tavenard, Matthijs Douze, and Laurent Amsaleg. Searching in one billion vectors: rerank with source coding. In International Conference on Acoustics, Speech, and Signal Processing, 2011.
 David G. Lowe. Distinctive image features from scaleinvariant keypoints. International journal of Computer Vision, 60(2), 2004.
 Marius Muja and David G. Lowe. Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 2014.
 Yu A Malkov and DA Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. arXiv preprint arXiv:1603.09320, 2016.
 Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
 Raghavendran Balu, Teddy Furon, and Hervé Jégou. Beyond “project and sign” for cosine estimation with binary codes. In International Conference on Acoustics, Speech, and Signal Processing, 2014.
 Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billionscale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017.