A Revisit on Deep Hashings for Largescale Content Based Image Retrieval
Abstract
There is a growing trend in studying deep hashing methods for contentbased image retrieval (CBIR), where hash functions and binary codes are learnt using deep convolutional neural networks and then the binary codes can be used to do approximate nearest neighbor (ANN) search. All the existing deep hashing papers report their methods’ superior performance over the traditional hashing methods according to their experimental results. However, there are serious flaws in the evaluations of existing deep hashing papers: (1) The datasets they used are too small and simple to simulate the real CBIR situation. (2) They did not correctly include the search time in their evaluation criteria, while the search time is crucial in real CBIR systems. (3) The performance of some unsupervised hashing algorithms (e.g., LSH) can easily be boosted if one uses multiple hash tables, which is an important factor should be considered in the evaluation while most of the deep hashing papers failed to do so. We reevaluate several stateoftheart deep hashing methods with a carefully designed experimental setting. Empirical results reveal that the performance of these deep hashing methods are inferior to multitable IsoH, a very simple unsupervised hashing method. Thus, the conclusions in all the deep hashing papers should be carefully reexamined.
1 Introduction
Contentbased image retrieval (CBIR) [22] is an interesting and popular problem in computer vision and information retrieval. As the amount of image data grows explosively, approximate nearest neighbor (ANN) [8] search becomes a necessary component of CBIR systems, so that images can be retrieved efficiently. Hashing is one of the most popular ANN search approaches. The main idea of hashing based ANN methods is to map images into a similaritypreserved hamming space where the search space can be efficiently pruned.
Deep convolutional neural networks (DCNN) have been successfully applied to the CBIR task [25], because they can learn better feature representations and be designed as endtoend models easily. Moreover, a great number of approaches have been proposed to use DCNN for binary code learning [27, 15, 31, 11, 14, 30, 33, 2, 12, 26, 16, 4, 19, 3, 28, 13, 23, 32, 29]. All these papers claim that deep hashing methods outperform the traditional hashing methods such as LSH [6]. However, there are serious flaws in the evaluations of these deep hashing papers:

In [27, 15, 31, 11, 14, 30, 33, 2, 12, 26, 16, 4, 19, 3, 28, 13, 23, 32], the datasets used are very small and have very limited number of classes. A good performance on these simple datasets cannot guarantee the good performance on reallife CBIR tasks. Some papers [29] use the Imagenet dataset [20] which is large. However, a fully supervised setting is used which is also not appropriate.

All the ANN search methods (e.g., the hashing methods) sacrifice accuracy for efficiency. Thus, the search time must be reported when we report some accracy measures (mean average precision, precision at samples, precision at hamming radius , and precisionrecall curves). However, most of the existing deep hashing papers [27, 15, 31, 11, 14, 30, 33, 2, 12, 26, 16, 4, 19, 3, 28, 13, 23, 32, 29] failed to do so.

The performance of some unsupervised hashing algorithms (e.g., LSH [6]) can easily be boosted if one uses multiple hash tables, while it is not clear how to use the same trick (multiple tables) for the deep hashing algorithms. This is an important factor should be considered in the evaluation while most of the papers [27, 15, 31, 11, 14, 30, 33, 2, 12, 26, 16, 4, 19, 3, 28, 13, 23, 32, 29] failed to do so.
Therefor, the conclusions (the deep hashing methods are superior than the traditional hashing methods) in all these deep hashing papers should be carefully reexamined.
In this paper, we carefully designed the experiments: 1) use the imagenet dataset; 2) use the precisiontime curves; 3) use the multiple tables trick for traditional hashing methods. Three stateoftheart deep hashing methods [14, 12, 29] are compared with the LSH [6] and IsoH [9]. Experimental results reveal that the performance of deep hashing methods are inferior to IsoH [9], which is a simple unsupervised hashing method.
The goal of this paper is not aiming at proving that some traditional unsupervised hashing methods are better than the deep hashing methods. We only want to show that the claim hold in most of the existing deep hashing papers [27, 15, 31, 11, 14, 30, 33, 2, 12, 26, 16, 4, 19, 3, 28, 13, 23, 32, 29] that deep hashing is superior than traditional hashing should be carefully reexamined.
2 Search With The Hash Index
Most of the deep hashing papers failed to include the search time in the evaluation. One of the reasons may be that it is very natural to think two hashing algorithms spend the same time on retrieving the same number of images with the same code length. However, with careful analysis on searchwiththehashindex procedure and experiments, we can find that this initial thought is wrong [1]. [1] provides a very detailed analysis on how to search with the hash index, we simply restate the main conclusions here.
A hashing algorithm generating one bit code actually partitions the original image feature space into two parts, the images in one part receive code 1 and the images in the other part receive code 0. When bits code is used, the hashing algorithm actually partitions the image feature space into parts, which can be named as hash buckets. Thus, all the images fall into different hash buckets (associated with different binary codes). Ideally, if neighbor images fall into the same bucket or the nearby buckets (measured by the hamming distance of binary codes), the hashing method can efficiently retrieve the nearest neighbors of a query image.
Suppose the user submit a query and ask for images in the database, the procedure of searchwiththehashindex (One can use the parameter to control the tradeoff between efficiency and accuracy, which is the number of images retrieved to the scanning pool) is as follows:

Encode the query image into the binary code.

Locate the bucket indexed by the same binary code as the query code (two codes with the hamming distance ) in the hash table, and retrieve at most images in that bucket into the scanning pool. If there are less than images in that bucket, the hamming distance between the query code and the fetched buckets is increased. This process is carried out recursively until images are retrieved.

The images in the scanning pool are sorted according to their distances from the query image and nearest neighbors are returned.
The time spent on each step [1] is as follows:

Coding time: The time used to convert the query image to the binary code.

Locating time: The time used to locate the buckets in the hash table and retrieve candidate images.

Scanning time: The time used to scan images in the scanning pool and return results.
The coding time and scanning time (proportional to the images in the scanning pool) can be easily analyzed. However, most of the existing deep hashing papers ignored the locating time which becomes the dominate part when one aims at high accuracy [1].
Given a binary code , locating the hashing bucket corresponding to costs time (by using std::vector or std::unordered_map). If we only need to examine a small number of hash buckets, the locating time can be neglected. This happened if all the neighbor images fall into the same bucket or the nearby buckets (measured by the hamming distance of binary codes) ideally. However, there is no guarantee that all the neighbor images will fall into the nearby buckets. To ensure a high precision, one needs to examine many hashing buckets (i.e., enlarge the search radius in hamming space).
Given a bits binary code , considering those hashing buckets whose hamming distance to is small or equal to . It is easy to show the number of these buckets is , which increases almost exponentially with respect to . Thus, the locating time can not be ignored if we aim at achieving a high precision (i.e., we need to examine many hash buckets).
Even with the same code length and with the same parameter , two different hashing algorithms can lead to significant different locating time. This main due to the different distributions of images over the hamming space. Indeed the locating time is highly related to the quality of the binary codes, and it should never be ignored when evaluating hashing methods for CBIR.
3 Deep Hashing VS. Traditional Hashing
Inspired by the revolutionary success of DCNN on the computer vision tasks, researchers proposed to use DCNN for binary code learning, so called deep hashing methods [27, 15, 31, 11, 14, 30, 33, 2, 12, 26, 16, 4, 19, 3, 28, 13, 23, 32, 29]. The deep models for deep hashing learning are not complicated. Actually, these deep hashing models can be modified from any conventional deep model (for classification tasks, e.g., AlexNet [10], VGG16 [21], GoogLeNet [24] and ResNet [7]) simply by replacing the output layer to various deep hashing modules (for the purpose of various loss functions, ensuring the binary outputs, etc.). Figure 1 gives an example of converting a conventional deep model (we use AlexNet as an example) to a deep hashing model (we use SSDH [29] as an example). The gray box part in Figure 1 (b) shows the deep hashing module used in SSDH [29].
For the representation purpose, both the layer in the AlexNet (Figure 1 (a)) and the layer in the Deep Hashing Net (Figure 1 (b)) provide the 4096dimensional real vector representation of the input image. With this real vector representation, the traditional hashing algorithms (e.g., LSH [6]) can then be applied for binary code learning. One of the main motivations of deep hashing methods is the joint learning of representation and binary code could leads to better binary codes [27, 15, 31, 11, 14, 30, 33, 2, 12, 26, 16, 4, 19, 3, 28, 13, 23, 32, 29].
However, this motivation should be carefully examined. Since the binary codes are learned from the real vector representation, the goodness of the real vectors are one of the keys determining the performance of the binary codes. Thus, it is unfair to compare deep hashing methods with traditional hashing algorithms with handcraft features as the inputs [27, 33, 2, 23]. Moreover, with the deep hashing module, it has high possibility that the 4096dimensional real features of the layer in the AlexNet (Figure 1 (a)) will be different to the 4096dimensional real features of the layer in the Deep Hashing Net (Figure 1 (b)), although both two deep networks share the same structure in the formal part. Thus, it is also unfair to compare the deep hashing codes learned from the Figure 1 (b) and the traditional hashing algorithms with the input from the layer in the Figure 1 (a). A fare comparison should let the traditional hashing algorithms take the inputs from the layer in the Figure 1 (b).
Another important difference between deep hashing and traditional hashing is that many traditional unsupervised hashing algorithms can use the so called multiple hash tables trick [1] with almost no additional computational burdon. Take LSH [6] as an example. Since LSH is essentially based on random projection, two hash tables generate by LSH naturally will be different (i.e., a query point will have different neighbor vectors in nearby hash buckets). To locate points, if we only have one hash table, we have to increase the hamming radius if the points in all the buckets within the hamming distance are not enough. This will increase the locating time a lot. If we have multiple hash tables, instead of increasing , we can scan the buckets within the hamming radius in all the tables, which gives us a larger chance to locate enough data points. There are plenty experimental results in [1] show the superior performance by using the multiple tables trick.
To use the multiple tables trick, a hashing algorithm must generate different hash tables for the same dataset. Some hashing algorithms (e.g., LSH [6] and IsoH [9]) have randomness in nature and naturally can use the multiple tables trick. Meanwhile, there is no need to change the inputs (4096dimensional real vectors) for these traditional hashing algorithms, i.e., there is no need to train multiple deep models. During the training stage, since the training process for many traditional hashing algorithms (e.g., LSH [6] and IsoH [9]) are very efficient, there is almost no additional computational burden on training. More importantly, during the search stage, using LSH or IsoH to convert a real vector to a binary vector is extremely fast, the coding time will be almost the same for single table or multiple tables (see next section for a detailed analysis).
For a deep hashing method, since the deep model usually converges at a local optimum, it is possible that two times of training generate different deep hashing codes. However, this means we need to train the deep model several times (depend on how many hash tables we want to use). This process will introduce significant amount of computational burden on the training stage for a large scale data. More importantly, feedforward the query image through multiple deep networks increase the coding time significantly. This makes the multiple trick cannot be used for all the deep hashing methods.
4 Experiments
In the remaining part of the paper, we will perform extensive experiments to support our finding. We begin with the description on the datasets we used in the experiments.
4.1 Datasets
Three datasets are used in this paper. Two of them are small and one is large.

CIFAR10 which contains 600,000 images belonging to 10 categories.

MNIST which contains 700,000 images belonging to 10 categories.

Imagenet which contains more than 1.2M images belonging to 1,000 categories.
4.2 Compared Methods
Three deep hashing methods are compared in the experiments, they are:

DLBH in the paper of Deep Learning of Binary Hash codes for fast image retrieval [14].

DPSH in the paper of feature learning based Deep Supervised Hashing with Pairwise labels [12].

SSDH in the paper of Supervised learning of Semanticspreserving Hash via Deep convolutional neural networks [29].
The main reason to pick these thee methods is all of them provide publicly available codes^{1}^{1}1https://github.com/kevinlin311tw/caffecvprw15 ^{2}^{2}2http://cs.nju.edu.cn/lwj/code/DPSH.zip ^{3}^{3}3https://github.com/kevinlin311tw/CaffeDeepBinaryCode.
As we have discussed in the Section 3, a deep hashing model can be modified from any conventional deep model for the classification task. To make fair comparisons, all the three deep hashing models are adapted from the ResNet [7] architecture. Our implementation is based on the torch ResNet implementation^{4}^{4}4https://github.com/facebook/fb.resnet.torch/tree/master. The deep hashing modules for three deep hashing methods are strictly follow the original implementations in the publicly available codes.
We use the ResNet34 network for the two small datasets and the ResNet50 network for the Imagenet dataset. All the images are resize to 224*224 in order to fit the ResNet input size.
Two traditional unsupervised hashing methods are compared in the experiments, they are:

LSH is a short name for Locality Sensitive Hashing [6]. LSH is based on random projection and is frequently used as a baseline method in various hashing papers.

IsoH is a short name for Isotropic Hashing [9]. IsoH learns the projection functions with isotropic variances for PCA projected data. The main motivation of IsoH is that PCA directions with different variance should not be equally treated (one bit for one direction). The performance of IsoH is quite good according to the comparative study in [1].
Both algorithms can be downloaded at github^{5}^{5}5https://github.com/dengcai78/MatlabFunc/tree/master/ANNS/Hashing.
One reason to choose these two hashing algorithms is the efficiency of these two algorithms in the test stage. Both algorithms only need a matrix multiplication to convert the input vector to the required binary code. Thus, the LSH and IsoH can also be modeled using the deep network in Figure 1 (a). After the network learning, one can simply replace the weights connecting the layer and the output layer by the transformation matrix learned in LSH (or IsoH).
We report the performance of LSH and IsoH using single table and 16 tables. For deep hashing methods, we only report the performance using single table. The reason is that it is too time consuming to learn multiple deep hashing tables. Take the Imagenet dataset as an example. Train a deep hashing model modified from the ResNet50 network with only 10% training data on a 4 GTX1080 GPU machine costs more than 20 hours. In contract, train a LSH table from learned deep features only needs 2.09s and train a IsoH table only needs 35.16s on an i75930K CPU. Moreover, using multiple tables for deep hashing methods needs to keep multiple deep models. The coding time of using multiple deep hashing tables will be extremely longer (every query image has to feedforward through multiple deep models), see the next subsection for a detailed analysis.
To make fair comparisons, we use the features extracted from the fullyconnected layer immediately before the layer that generates binary codes of DLBH as the inputs to LSH and IsoH. The dimensionality of the feature is 512 for the two small datasets and 2048 for the Imagenet dataset.
4.3 Evaluation Criteria
Given a query image, the algorithm is expected to return images. Then we need to examine how many images in this returned set are relevant to the query image (share the same label). Suppose the returned set of relevant images given a query is , the can be defined [17] as
(1) 
We fixed and throughout our experiments.
As we have discussed in the section 2, the search time should be reported as well as the precision. Thus, we use the precisiontime curve for evaluation. Ideally, the search time should include all the three parts: coding time, locating time and scanning time.
The coding time is the time used to convert the query image to the binary code. This can be divided as two parts. The first part is converting the image to the real vector feature (from the input to layer in the Figure 1 (b)), which is the same for all the hashing methods (both deep hashing methods and traditional hashing methods). The second part is converting the real vector feature to binary code (deep hashing methods use the deep hashing module in Figure 1 (b) while the traditional hashing use a matrix multiplication). Since the time spent on the first part is the dominating part in the coding time, the coding time for deep hashing methods and LSH (or, IsoH) are almost the same.
Take the Imagenet as an example, based on the study on github^{6}^{6}6https://github.com/jcjohnson/cnnbenchmarks, if the ResNet50 network is used, the first part of the coding time is around 50ms per image on GTX1080. If we have 25,000 query images, the total time for the first part is around 125s. While the total processing time (the second part of the coding time) for LSH (or IsoH) for 25,000 query images is around 0.037s on an i75930K CPU. Even we need 16 hash tables for LSH (or IsoH), the time on the second part can still be neglected compared to the time on the first part.
Thus, we can safely conclude that the coding time for deep hashing methods and LSH (or IsoH) are almost the same, even if LSH (or IsoH) uses 16 hash tables. Moreover, our primary goal is evaluating the quality of the binary codes generated by different hashing methods. This will be mainly reflected by the locating and scanning time.
Thus, the time in precisiontime curve reported in our experiments will only include the locating time and scanning time. After obtaining the codes generated by different hashing methods, we use the open source c++ search with the hash index code^{7}^{7}7https://github.com/fc731097343/efanna/tree/master/samples_hashing on the same i75930K CPU for fair evaluation (by tuning the parameter , we can get the curves for different hashing methods).
Dataset  #Train  #Val  #Base  #Query 

CIFAR10  5000  5000  50000  5000 
MNIST  6000  5000  60000  5000 
The images in Val and Query sets are from the test split provided by the original dataset. The images in Base set are from the train split provided by the original dataset. The images in Train set are 10% randomly selected from the Base set. Thus, the images in Val, Query and Base sets are different from each other.
4.4 Experiments on Small Datasets
In this section, we report the experimental results on two small datasets, i.e., CIFAR10 and MNIST.
Based on the datasets’ original train/test splits, we build our base/train/validation/query splits. We use the train/validation sets to train the deep hashing networks, and use the base/query sets to simulate the image search. On both two datasets, we use the original train set as our base set, and randomly select 10% images from each category in the base set to form our train set. We then split each category of the original test set evenly and randomly into our validation and query sets, so that our models cannot see the query images when training to learn the binary codes. More details of each set’s size are recorded in the Table 1.
Figure 2 and 3 show the performances of various hashing methods on CIFAR10 and MINST respectively. We use 24bit code for all the hashing methods (Please see the supplementary file for detailed discussion on the selection of the code length).
We can clearly see the performance boost by using 16 tables than single table on LSH and IsoH at all the cases. However, even 16 tables are used, the best deep hashing method still achieve significant better results on both datasets. This result is consistent with the previous deep hashing papers.
However, CIFAR10 and MNIST are too small and the category number is only 10. The results on these two small datasets are not enough to convince people that best method so far will perform the best on a real large scale complicated CBIR system.
4.5 Experiments on Imagenet
We need to use a more complicated and larger dataset. The Imagenet dataset [20] is a good choice. It contains more than 1.2M images belonging to 1,000 categories.
4.5.1 Fully supervised setting
Actually, the Imagenet dataset has been used in the SSDH paper [29]. However, the labels of all the images in the base set are used for learning the binary codes. In reallife CBIR scenarios, the image database is very large, and it will be too expensive and timeconsuming to manually label all the images in the base set. Moreover, with all the label information of the base set available, one can use a very simple but effective coding scheme which makes all other hashing algorithms meaningless.
Dataset  #Train  #Val  #Base  # Query 

Imagenet  1281167  25000  1281167  25000 
The images in Val and Query sets are from the validation split provided by the original dataset. The images in Train and Base set are the same, come from the train split provided by the original dataset. Thus, the images in Val, Query and Base sets are different from each other.
Classification random coding: With the label information of the image in the entire base set available, we can design a simple binary encoding scheme with the help of a well trained classifier: suppose the code length is , each class in the dataset is uniquely and randomly mapped to an integer ranging from to , and then the integer is converted to its binary representation of length . Thus, each class maps to a unique binary code. The images in the base set can be simply encoded using the labels and the query images can be encoded using the predicted labels obtained from the classifier. We denote this coding scheme as classification random coding (CRC).
With this coding scheme, there are only (the number of classes) non empty buckets, and each bucket contains all the base images belonging to the corresponding class. In the search stage, if the required number of returns is smaller then the number of images in one class (which usually is the case), one only need to locate one hash bucket (with the hamming distance 0 to the binary code of the query image). This means the locating time of CRC can almost be ignored. Moreover, if the classifier correctly predicts the class label of the query image, all the returned results will be relevant and the precision is 100%. If the classifier misclassified the query image, the precision then is 0. Then the average search precision will be the same as the accuracy of the classifier. If the classifier is good, by using CRC, one can achieve a very good search precision with a very short amount of time.
Figure 4 shows the performance of all the compared hashing methods, including CRC, on the fully supervised Imagenet. We have 25,000 query images and please see the Table 2 for more details of each set’s size. All the hashing methods use 32bit code. The DPSH method [12] is missed in the figure. We tried all the combination of the parameters but the DPSH model failed to converge. This probably due to there are too many of training images.
Again, 16 tables LSH (or, IsoH) is better than its single table version. IsoH is slightly better than LSH considering Precision@10. If we consider Precision@100, the advantage of IsoH over LSH becomes clear which is consistent with the finding in [1].
The improvement of 16 tables IsoH over the other two deep hashing methods is significant which makes the conclusion in most of the deep hashing papers doubtful.
CRC achieves the best performance. It is interesting to find that the precisions of all the other hashing methods converge to the precision of the bruteforce search. While the precision of CRC is significantly better than that of the bruteforce search. The reason might be the performance of the bruteforce search is somehow like the nearest neighbor classifier on deep features while the precision of CRC equals the accuracy of a linear classifier on deep features.
Dataset  #Train  #Val  #Base  #Query 

Imagenet  128116  25000  1281167  25000 
The images in Val and Query sets are from the validation split provided by the original dataset. The images in Base set are from the train split provided by the original dataset. The images in Train set are 10% randomly selected from the Base set. Thus, the images in Val, Query and Base sets are different from each other.
4.5.2 Partially supervised setting
This time we use a more realistic setting by using only 10% images in the base set as the supervised training image. More details of each set’s size can be see on Table 3. For the reproduction purpose, we release the learned 2048dimensional features and the train/val/base/query splits^{8}^{8}8http://www.cad.zju.edu.cn/home/dengcai/Data/ANNS/ANNSData.html.
Figure 5 shows the performance of all the compared hashing methods on the partially supervised Imagenet and all the hashing methods use 32bit code.
We can clearly see the advantage of using multiple table trick for LSH and IsoH. When we consider Precision@100, the DLBH method (the best performed deep hashing method among the three) does have some improvement over single table LSH and IsoH. However, 16 tables IsoH significantly outperforms DLBH.
Overall, if the search time is correctly measured, the multiple tables trick is used and a large and complicated dataset is used, the claim in most of the deep hashing papers becomes wrong.
When a large scale and complected dataset is used, the precisions of all the hashing methods converge to the precision of the bruteforce search. It seems that the precision of the bruteforce search is the upperbound for all the hashing methods. This never happened when a simple dataset is used or the label of all the base images are known.
4.5.3 Effect of the code length
Most of the existing deep hashing papers [27, 15, 31, 11, 33, 2, 12, 26, 16, 4, 19, 3, 28, 13, 23, 32] report the performances of various hashing methods increase as the code length increase. In this subsection, we will carefully reexamine this conclusion.
Figure 6 shows the performance of DLBH with different code length on the partially supervised Imagenet. Figure 6 (a) shows the precisiontime curves while Figure 6 (b) shows the precision# of retrieved samples curves. From Figure 6 (a), we can see the best performance of DLBH is achieved when the code length is 32. As the code length further increases, the search time increases faster. Figure 6 (b) shows the performance of DLBH consistently increases as the code length increases. This simply because the time reported in Figure 6 (a) includes locating time and scanning time while the # of retrieved samples in Figure 6 (b) can only reflect the scanning time. Ignoring the locating time is the reason that most of the deep hashing papers [27, 15, 31, 11, 33, 2, 12, 26, 16, 4, 19, 3, 28, 13, 23, 32] made a wrong conclusion.
From Figure 6 (b), we can see if # of retrieved samples is set as 100 (the starting point of all the curves), DLBH (16 bits) reaches around 13% precision while DLBH (36 bits) reaches almost 34% precision. From Figure 6 (a), we can see DLBH (16 bits) uses around 7.5s while DLBH (36 bits) uses around 200s. Since # of retrieved samples is fixed as 100, the scanning times for the two cases are the same. It is the locating time causing this 190s difference, i.e., the locating time (locating 100 samples) of DLBH (36 bits) is about 20 times than that of DLBH (16 bits). In other words, DLBH (36 bits) can find better candidates but needs longer time than DLBH (16 bits).
Figure 7 provides an explanation. The first row of Figure 7 shows the number of queries (the total number is 25,000) which successfully located 100 samples in different hamming radius of DLBH with different code length. Figure 7 (a) shows more than 18,000 queries retrieved 100 images successfully within hamming radius 0 (i.e., more than 18,000 queries only need to visit one hash bucket) when DLBH uses 16 bits. As a comparison, if DLBH uses 36 bits, Figure 7 (e) shows less than 6,000 queries successfully retrieved 100 images (i.e., the remaining queries need to visit more hash buckets). The second row of Figure 7 shows that the hash buckets number grows quickly as the code length increases when the radius fixed.
To retrieve 100 images, the number of buckets needed to be located increases almost exponentially as the code length increases. Thus, the locating time increases almost exponentially as the code length increases.
5 Conclusion
Deep hashing methods recently attract a lot of research interest. The idea is attractive, but these methods’ effectiveness and efficiency needs a carefully reexamination. Three important factors are missed in most of the previous deep hashing papers. 1) They failed to correctly measure the search time. 2) They compared with the suboptimal version of traditional hashing algorithms (failed to use the multiple tables trick). 3) They use some very small and simple data sets for evaluations. Under a more realistic setting, the results are quite surprising: several representative and stateoftheart deep hashing methods cannot outperform the traditional multitable IsoH.
References
 [1] D. Cai. A revisit of hashing algorithms for approximate nearest neighbor search. arXiv preprint arXiv:1612.07545, 2016.
 [2] Y. Cao, M. Long, J. Wang, H. Zhu, and Q. Wen. Deep quantization network for efficient image retrieval. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 3457–3463, 2016.
 [3] Z. Cao, M. Long, J. Wang, and Q. Yang. Transitive hashing network for heterogeneous multimedia retrieval. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, pages 81–87, 2017.
 [4] T.T. Do, A.D. Doan, and N.M. Cheung. Learning to hash with binary deep neural network. In Proceedings of European Conference on Computer Vision, pages 219–234, 2016.
 [5] C. Fu, C. Wang, and D. Cai. Fast approximate nearest neighbor search with navigating spreadingout graphs. arXiv preprint arXiv:1707.00143, 2017.
 [6] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, 1999.
 [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 [8] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, 1998.
 [9] W. Kong and W. Li. Isotropic hashing. In Advances in Neural Information Processing Systems 25, 2012.
 [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [11] H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 3270–3278, 2015.
 [12] W.J. Li, S. Wang, and W.C. Kang. Feature learning based deep supervised hashing with pairwise labels. In Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, pages 1711–1717, 2016.
 [13] J. Lin, Z. Li, and J. Tang. Discriminative deep hashing for scalable face image retrieval. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, pages 2266–2272, 2017.
 [14] K. Lin, H.F. Yang, J.H. Hsiao, and C.S. Chen. Deep learning of binary hash codes for fast image retrieval. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 27–35, 2015.
 [15] V. E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou. Deep hashing for compact binary codes learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2475–2483, 2015.
 [16] H. Liu, R. Wang, S. Shan, and X. Chen. Deep supervised hashing for fast image retrieval. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2064–2072, 2016.
 [17] J. Makhoul, F. Kubala, R. Schwartz, and R. Weischedel. Performance measures for information extraction. In Proceedings of DARPA Broadcast News Workshop, 2000.
 [18] Y. A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. arXiv preprint arXiv:1603.09320, 2016.
 [19] Y. Mu and Z. Liu. Deep hashing: A joint approach for image signature learning. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, pages 2380–2386, 2017.
 [20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [21] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In arXiv preprint arXiv:1409.1556, 2014.
 [22] A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Contentbased image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349–1380, 2000.
 [23] S. Su, G. Chen, X. Cheng, and R. Bi. Deep supervised hashing with nonlinear projections. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, pages 2786–2792, 2017.
 [24] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 [25] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li. Deep learning for contentbased image retrieval: A comprehensive study. In Proceedings of the 22nd ACM International Conference on Multimedia, pages 157–166, 2014.
 [26] Z. Wang, Y. Yang, S. Chang, Q. Ling, and T. S. Huang. Learning a deep  encoder for hashing. In Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, pages 2174–2180, 2016.
 [27] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hashing for image retrieval via image representation learning. In Proceedings of the TwentyEighth AAAI Conference on Artificial Intelligence, pages 2156–2162, 2014.
 [28] E. Yang, C. Deng, W. Liu, X. Liu, D. Tao, and X. Gao. Pairwise relationship guided deep hashing for crossmodal retrieval. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, pages 1618–1625, 2017.
 [29] H.F. Yang, K. Lin, and C.S. Chen. Supervised learning of semanticspreserving hash via deep convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, (DOI 10.1109/TPAMI.2017.2666812), 2017.
 [30] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang. Bitscalable deep hashing with regularized similarity learning for image retrieval and person reidentification. IEEE Transactions on Image Processing, 24(12):4766–4779, 2015.
 [31] F. Zhao, Y. Huang, L. Wang, and T. Tan. Deep semantic ranking based hashing for multilabel image retrieval. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1556–1564, 2015.
 [32] H. Zhu and S. Gao. Locality constrained deep supervised hashing for image retrieval. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, pages 3567–3573, 2017.
 [33] H. Zhu, M. Long, J. Wang, and Y. Cao. Deep hashing network for efficient similarity retrieval. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 2415–2421, 2016.