Learning Hash Codes via Hamming Distance Targets
Abstract
We present a powerful new loss function and training scheme for learning binary hash codes with any differentiable model and similarity function. Our loss function improves over prior methods by using log likelihood loss on top of an accurate approximation for the probability that two inputs fall within a Hamming distance target. Our novel training scheme obtains a good estimate of the true gradient by better sampling inputs and evaluating loss terms between all pairs of inputs in each minibatch. To fully leverage the resulting hashes, we use multiindexing. We demonstrate that these techniques provide large improvements to a similarity search tasks. We report the best results to date on competitive information retrieval tasks for ImageNet and SIFT 1M, improving MAP from 73% to 84% and reducing query cost by a factor of 28, respectively.
1 Introduction
Many information retrieval tasks rely on searching highdimensional datasets for results similar to a query. Recent research has flourished on these topics due to enormous growth in data volume and industry applications [19]. These problems are typically solved in either two steps by computing an embedding and then doing lookup in the embedding space, or in one step by learning a hash function. We call these three problems the datatoembedding problem, the embeddingtoresults problem, and the datatoresults problem. There exists an array of solutions for each one.
Models that solve datatoembedding problems aim to embed the input data in a space where proximity corresponds to similarity. The most commonly chosen embedding space is , in order to leverage lookup methods that assume Euclidean distance. Recent methods employ neural network architectures for embeddings in specific domains, such as facial recognition and sentiment analysis [18, 16].
Once the datatoembedding problem is solved, numerous embeddingtoresults strategies exist for similarity search in a metric space. For this step, the main challenge is achieving high recall with low query cost. Exact nearest neighbors (KNN) algorithms achieve 100% recall, finding the closest items to the query in the dataset, but they can be prohibitively slow. Brute force algorithms that compare distance to every other element of the dataset are often the most viable KNN methods, even with large datasets. Recent research has enabled exact KNN on surprisingly large datasets with low latency [11]. However, the compute resources required are still large. Alternatives exist that can reduce query costs in some cases, but increase insertion time. For instance, d trees require search time on average with a high constant, but also require insertion time on average.
Approximate nearest neighbors algorithms solve the embeddingtoresults problem by finding results that are likely, but not guaranteed to be among the closest. Similarly, approximate nearneighbor algorithms aim to find most of the results that fall within a specific distance of the query’s embedding. These tasks (ANN) are generally achieved by hashing the query embedding, then looking up and comparing results under hashes close to that hash. Approximate methods can be highly advantageous by providing orders of magnitude faster queries with constant insertion time. Localitysensitive hashing (LSH) is one such method that works by generating multiple, randomlychosen hash functions for each input. Each element of the dataset is inserted into multiple hash tables, one for each hash function. Queries can then be made by checking all hash tables for similar results. Another approach is quantization, which solves ANN problems by partitioning the space of inputs into buckets. Each element of the dataset is inserted into its bucket, and queries are made by selecting from multiple buckets close to the query.
Datatoresults methods determine similarity between inputs and provide an efficient lookup mechanism in one step. These methods directly compute a hash for each input, showing promise of simplicity and efficiency. Additionally, machine learning methods in this category train endtoend, by which they can reduce inefficiencies in the embedding step. There has been a great deal of recent research into these methods in topics such as contentbased image retrieval (CBIR). In other topics such as automated scene matching, handchosen hash functions are common [1]. But despite recent focus, datatoresults methods have had mixed results in comparison to datatoembedding methods paired with embeddingtoresults lookup [20, 12].
We assert the main reason datatoresults methods have sometimes underperformed is that training methods have not adequately expressed the model’s loss. Our proposed approach trains neural networks to produce binary hash codes for fast retrieval of results within a Hamming distance target. These hash codes can be efficiently queried within the same Hamming distance by multiindexing [17].
1.1 Related Work
Additional context in quantization and learning to hash is important to our work. Quantization is considered stateoftheart in ANN tasks [20]. There are many quantization approaches, but two are particularly noteworthy: iterative quantization (ITQ) [5] and product quantization (PQ) [9]. Iterative quantization learns to produce binary hashes by first reducing dimensionality and then minimizing a quantization loss term, a measure of the amount of information lost by quantizing. ITQ uses principal component analysis for dimensionality reduction and for a quantization loss term, where is the prebinarized output and is the quantized hash. It then minimizes quantization loss by alternately updating an offset and then a rotation matrix for the embedding. PQ is a generally more powerful quantization method that splits the embedding space into . A means algorithm is run on the embedding constrained to each subspace, giving Voronoi cells in each subspace for a total of hash buckets.
Recent methods that learn to hash endtoend draw from a few families of loss terms to train binary codes [20]. These include terms for supervised softmax cross entropy between codes [8], supervised Euclidean distance between codes [13], and quantization loss terms [22]. Softmax cross entropy and Euclidean distance losses assume that Hamming distance corresponds to Euclidean distance in the prebinarized outputs. Some papers try to enforce that assumption in a few different ways. For instance, quantization loss terms aim to make that assumption more true by penalizing networks for producing outputs far from . Alternative methods to force outputs close to exist, such as HashNet, which gradually sharpens sigmoid functions on the prebinarized outputs. Another family of methods first learns a target hash code for each class, then minimizes distance between each embedding and its target hash code [21, 15].
We observed four main shortcomings of existing methods that learn to hash endtoend. First, cross entropy and Euclidean distance between prebinarized outputs does not correspond to Hamming distance under almost any circumstances. Second, quantization loss and learning by continuation cause gradients to shrink during training, dissuading the model from changing the sign of any output. Third, methods using target hash codes are limited to classification tasks, and have no obvious extension to applications with nontransitive similarity. Finally, various multistep training methods, including target hash codes, forfeit the benefit of training endtoend.
1.2 Multiindexing
Multiindexing enables search within a Hamming radius by splitting an bit binary hash into substrings of length [17]. Technically, it is possible to use any , but in most practical scenarios the best choice is . We consider only this case^{1}^{1}1In scenarios with a combination of extremely large datasets, short hash codes, and large , it is more efficient to use substrings and make up for the missing Hamming radius with bruteforce searches around each substring. However, since we are learning to hash, it makes more sense to simply choose a longer hash.. Each of these substrings is inserted into its own reverse index, pointing back to the content (Algorithm 1). Insertion runtime is therefore proportional to , the number of multiindices.
Lookup is performed by taking the union of all results for each substring, then filtering down to results within the Hamming radius (Algorithm 2). This enables lookup within a Hamming radius of by querying each substring in its corresponding index. Any result within will match on at least one of the substrings by pigeonhole principle.
With a welldistributed hash function, the average runtime of a lookup is proportional to the number of queries times the number of rows returned per query. Norouzi et al. treat the time to compare Hamming distance between codes as constant^{2}^{2}2A binary code can be treated as a long for , giving constant time to XOR bits with another code on x64 architectures. Summing the bits is , but small compared to the practical cost of retrieving a result., giving us a query cost of
where is the total number of bit hashes in the database. Like Norouzi et al., we recommend choosing such that , providing a runtime of
We build on this technique in 2.3.
2 Method
We propose a method of Hamming distance targets (HDT) that can be used to train any differentiable, black box model to hash. We will focus on its application to deep convolutional neural nets trained using stochastic gradient descent. Our loss function’s foundation is a statistical model relating pairs of embeddings to Hamming distances.
2.1 Loss Function
2.1.1 Motivation
Let be the model’s embedding for an input , and let be the distribution of inputs to consider. We motivate our loss function with the following assumptions:

If is a random input, then . We partially enforce this assumption via batch normalization of with mean 0 and variance 1.

is independent of other .
Let be the normalized output vector. Since is a vector of independent random normal variables, is a random variable distributed uniformly on the hypersphere.
This normalization is the same as SphereNorm [14] and similar to Riemannian Batch Normalization [3]. Liu et al. posed the question of why this technique works better in conjunction with batch norm than either approach alone, and our work bridges that gap. An normalized vector of IID random normal variables forms a uniform distribution on a hypersphere, whereas most other distributions would not. An uneven distribution would limit the regions on the hypersphere where learning can happen and leave room for internal covariate shift toward different, unknown regions of the hypersphere.
To avoid the assumption that Euclidean distance translates to Hamming distance, we further study the distribution of Hamming distance given these normalized vectors. We craft a good approximation for the probability that two bits match, given two uniformly random points on the hypersphere, conditioned on the angle between them.
We know that , so the arc length of the path on the unit hypersphere between them is . A half loop around the unit hypersphere would cross each of the axis hyperplanes (i.e. ) once, so a randomly positioned arc of length crosses axis hyperplanes on average (Figure 1). Each axis hyperplane crossed corresponds to a bit flipped, so the probability that a random bit differs between these vectors is
Given this exact probability, we estimate the distribution of Hamming distance between and by making the approximation that each bit position between the two vectors differs independently from the others with probability . Therefore, the probability of Hamming distance being within is approximately where is the binomial CDF. This approximation proves to be very close for large (Figure 2).
Prior hashing research has made inroads with a similar observation, but applied it in the limited context of choosing vectors to project an embedding onto for binarization [10]. We apply this idea directly in network training.
2.1.2 Formulation
With batch size , let be our batchnormalized logit layer for a batch of inputs and be the rownormalized version of ; that is, . Let .Let be the vector of all our model’s learnable weights. Let be a similarity matrix such that if inputs and are similar and otherwise. Define to be the Hammard product, or pointwise multiplication.
Our loss function is
with

, the average log likelihood of each similar pair of inputs to be within Hamming distance .

, the average log likelihood of each dissimilar pair of inputs to be outside Hamming distance .

, a regularization term on the model’s learnable weights to minimize overfitting.
Note that terms and work on all pairwise combinations of images in the batch, providing us with a very accurate estimate of the true gradient.
While most machine learning frameworks do not currently have a binomial CDF operation, many (e.g., Tensorflow and Torch) support a differentiable operation for a beta distribution’s CDF. This can be used instead via the wellknown relation between the binomial CDF and the beta CDF :
For values of that are too low, this quantity underflows floating point numbers. This issue can be addressed by a linear extrapolation of log likelihood for . An exact formula exists, but a simpler approximation suffices, using the fact that for small :
2.2 Training Scheme
We construct training batches in a way that ensures every input has another input in the batch it is similar to. Specifically, each batch is composed of groups of inputs, where each group has one randomly selected marker input and random inputs similar to the marker. We then choose random groups to form. During training, similarity between inputs is determined dynamically, such that if two inputs from different groups happen to be similar, they are treated as such.
This method ensures that each loss term is welldefined, since there will be both similar and dissimilar inputs in each batch. Additionally, it provides a better estimate of the true gradient by balancing the huge class of dissimilar inputs with the small class of similar inputs.
2.3 Multiindexing with Embeddings
For additional recall on ANN tasks, we store our model’s embedding in each row of the multiindex. We use this to rank results better, returning the closest of them to the query embedding.This adds to query cost, since evaluating the Euclidean distance between the query’s embedding scales with the hash size and obtaining the top elements is per result. The heightened query cost allows us to compare query cost against quantization methods, which do the same ranking of final results by embedding distance. When using embeddings to better rank results in this way, we call our method HDTE.
3 Results
3.1 ImageNet
We compared HDT against reported numbers for other machine learning approaches to similar image retrieval on ImageNet. We followed the same methodology as Cao et al., using the same training and test sets drawn from 100 ImageNet classes and starting from a pretrained Resnet V2 50 [6] ImageNet checkpoint accepting images. Fine tuning each model took 5 hours on a single Titan Xp GPU. Following convention, we computed mean average precision (MAP) for the first 1000 results by Hamming distance as our evaluation criterion. We also study our model’s precision and recall at different Hamming distances (Figure 3).
We highlight 5 comparator models: DBRv3 [15], HashNet [2], Deep hashing network for efficient similarity retrieval (DHN) [23], Iterative Quantization (ITQ) [5], and LSH [4]. DBRv3 learns by first choosing a target hash code for each class to maximize Hamming distance between other target hash codes, then minimizing distance between each image’s embedding and target hash code. To the best of our knowledge, it has the highest reported MAP on the ImageNet image retrieval task until this work. HashNet trains a neural network to hash with a supervised cross entropy loss function by gradually sharpening a sigmoid function of its last layer until the outputs are all close to . DHN similarly trains a neural network with supervised cross entropy loss, but with an added binarization loss term to coerce outputs close to instead of sharpening a sigmoid. Using and , our method achieved 81.283.8% MAP for hash bit lengths from 16 to 64 (Table 1), a 4.310.5% absolute improvement over the next best method.
Most interestingly, HDT performed better on shorter bit lengths. A shorter hash should be strictly worse, since it can be padded with constant bits to a longer hash. Our result may reflect a capacity for the model to overfit slightly with larger bit lengths, an increased difficulty to train a larger model, or a need to better tune parameters. In any case, the clear implication is that 16 bits are enough to encode 100 ImageNet classes.
3.2 Sift 1m
We compared HDT against the stateoftheart embeddingtoresults method of Product Quantization on the SIFT 1M dataset, which consists of dataset vectors, training vectors, and query vectors in .
We trained HDT from scratch using a simple 3layer Densenet [7] with 256 reluactivated batchnormalized units per layer. During training, we defined input to be similar to if is among the 10 nearest neighbors to . Training each model took 75 minutes on a single Geforce 1080 GPU. We compared the recallquery cost tradeoff at different values of , , and (Table 2). We used the standard recall metric for this dataset of recall@100, where recall@ is the proportion of queries whose single nearest neighbor is in the top results.
HDTE defied even our expectations by providing higher recall than reported numbers for PQ while requiring fewer distance comparisons (Figure 4). This implies that even on embeddingtoresult tasks, HDTE can be implemented to provide better results than PQ with faster query speeds. The improvement is particularly great in the highrecall regime. Notably, HDTE gets 78.1% recall with an average of 12,709 distance comparisons, whereas PQ gets only 74.4% recall with 101,158 comparisons.
16  0  32.4%, 1463  20.6%, 366  12.0%, 80.6 

32  1  59.4%, 4984  42.0%, 1324  26.5%, 247 
64  2  90.1%, 42851  78.1%, 12709  64.5%, 4105 
4 Discussion
Our novel method of Hamming distance targets vastly improved recall and query speed in competitive benchmarks for both datatoresults tasks and embeddingtoresults tasks. HDT is also general enough to use any differentiable model and similarity criterion, with applications in image, video, audio, and text retrieval.
We developed a sound statistical model as the foundation of HDT’s loss function. We also shed light on why normalization of layer outputs improves learning in conjunction with batch norm. For future study, we are interested in better understanding the theoretical distribution of Hamming distances between points on a sphere separated by a fixed angle.
References
 [1] Aasif Ansari and Muzammil Mohammed. Content based video retrieval systems  methods, techniques, trends and challenges. In International Journal of Computer Applications, volume 112(7), 2015.
 [2] Zhangjie Cao, Mingsheng Long, and Philip S. Yu. Hashnet: Deep learning to hash by continuation. arXiv preprint arXiv:1702.00758 [cs.CV], 2017.
 [3] Minhyung Cho and Jaehyung Lee. Riemannian approach to batch normalization. In Advances in Neural Information Processing Systems 30 (NIPS 2017) preproceedings, 2017.
 [4] Aristrides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. VLDB, pages 518–529, 99.
 [5] Yunchao Gong and Svetlana Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2916–2929, 2013.
 [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016.
 [7] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, July 2017.
 [8] Himalaya Jain, Joaquin Zepeda, Patrick Perez, and Remi Gribonval. Subic: A supervised, structured binary code for image search. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 [9] Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. TPAMI, 33, 2011.
 [10] Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, and Qi Tian. Superbit localitysensitive hashing. In Conference on Neural Information Processing Systems, pages 108–116, 2012.
 [11] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billionscale similarity search with gpus. arXiv preprint arXiv:1702.08734 [cs.CV], 2017.
 [12] Benjamin Klein and Lior Wolf. In defense of product quantization. arXiv preprint arXiv:1711.08589 [cs.CV], 2017.
 [13] H. Liu, R. Wang, S. Shan, and X. Chen. Deep supervised hashing for fast image retrieval. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2064–2072, 2016.
 [14] Weiyang Liu, YanMing Zhang, Xingguo Li, Zhiding Yu, Bo Dai, Tuo Zhao, and Le Song. Deep hyperspherical learning. In Advances in Neural Information Processing Systems 30 (NIPS 2017) preproceedings, 2017.
 [15] Xuchao Lu, Li Song, Rong Xie, Xiaokang Yang, and Wenjun Zhang. Deep binary representation for efficient image retrieval. Advances in Multimedia, 2017.
 [16] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119, 2013.
 [17] Mohammad Norouzi, Ali Punjani, and David J. Fleet. Fast search in hamming space with multiindex hashing. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3108–3115, 2012.
 [18] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, June 2015.
 [19] J. Wang, W. Liu, S. Kumar, and S. F. Chang. Learning to hash for indexing big data: A survey. In IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 104(1), pages 34–57, 2016.
 [20] J. Wang, T. Zhang, j. song, N. Sebe, and H. T. Shen. A survey on learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):769–790, April 2018.
 [21] Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan. Supervised hashing for image retrieval via image representation learning. In AAAI Conference on Artificial Intelligence, 2014.
 [22] Yuefu Zhou, Shanshan Huang, Ya Zhang, and Yanfeng Wang. Deep hashing with triplet quantization loss. arXiv preprint arXiv:1710.11445 [cs.CV], 2017.
 [23] Han Zhu, Mingsheng Long, Jianmin Wang, and Yue Cao. Deep hashing network for efficient similarity retrieval. AAAI, 2016.