Query-adaptive Image Retrieval by
Deep Weighted Hashing
Hashing methods have attracted much attention for large scale image retrieval. Some deep hashing methods have achieved promising results by taking advantage of the strong representation power of deep networks recently. However, existing deep hashing methods treat all hash bits equally. On one hand, a large number of images share the same distance to a query image due to the discrete Hamming distance, which raises a critical issue of image retrieval where fine-grained rankings are very important. On the other hand, different hash bits actually contribute to the image retrieval differently, and treating them equally greatly affects the retrieval accuracy of image. To address the above two problems, we propose the query-adaptive deep weighted hashing (QaDWH) approach, which can perform fine-grained ranking for different queries by weighted Hamming distance. First, a novel deep hashing network is proposed to learn the hash codes and corresponding class-wise weights jointly, so that the learned weights can reflect the importance of different hash bits for different image classes. Second, a query-adaptive image retrieval method is proposed, which rapidly generates hash bit weights for different query images by fusing its semantic probability and the learned class-wise weights. Fine-grained image retrieval is then performed by the weighted Hamming distance, which can provide more accurate ranking than the traditional Hamming distance. Experiments on four widely used datasets show that the proposed approach outperforms eight state-of-the-art hashing methods.
WITH rapid growth of multimedia data on web, retrieving the relevant multimedia content from a massive database has been an urgent need, yet still remains a big challenge. Hashing methods map multimedia data into short binary codes to utilize the storage and computing efficiency of Hamming codes, thus they have been receiving increasing attentions in many multimedia application scenarios, such as image retrieval [1, 2, 3, 4, 5, 6, 7, 8], video retrieval [9, 10] and cross media retrieval [11, 12]. Generally speaking, hashing methods aim to learn mapping functions for multimedia data, so that similar data are mapped to similar binary codes. These hashing methods can be divided into two categories, namely unsupervised methods and supervised methods.
Unsupervised methods are proposed to design the hash functions without using image labels, and they can be divided into data independent methods and data dependent methods. The representative data independent method is Locality Sensitive Hashing (LSH) , which maps data into the binary codes by random linear projections. There are several extensions of LSH, such as SIKH  and Multi-probe LSH . Data dependent methods try to learn hash functions by analyzing the data properties, such as manifold structures and data distributions. For example, Spectral Hashing (SH)  is proposed to design the hash codes to be balanced and uncorrelated. Anchor Graph Hashing (AGH)  is proposed to use anchor graphs to discover neighborhood structures. Gong et al. propose Iterative Quantization (ITQ)  to learn hash functions by minimizing the quantization error of mapping data to the vertices of a binary hypercube. Topology Preserving Hashing (TPH)  is proposed to preserve the consistent neighborhood rankings of data points in Hamming space. Irie et al. propose Locally Linear Hashing (LLH)  to utilize locality-sensitive sparse coding to capture the local linear structures and then recover these structures in Hamming space.
Unsupervised methods can search nearest neighbors under a certain kind of distance metric (e.g. distance). However, the neighbors in feature space may not be semantically similar. Therefore, supervised hashing methods are proposed, which leverage the semantic information to generate effective hash codes. Binary Reconstruction Embedding (BRE)  is proposed to learn the hash functions by minimizing the reconstruction error between the original distances and the reconstructed distances in the Hamming space. Wang et al. propose Semi-supervised Hashing (SSH)  to learn hash functions by minimizing the empirical error over the labeled data while maximizing the information entropy of generated hash codes over both labeled and unlabeled data. Liu et al. propose Supervised Hashing with Kernels (KSH)  to learn the hash codes by preserving the pairwise relationship between data samples provided by labels. Order Preserving Hashing (OPH)  and Ranking Preserving Hashing (RPH)  are proposed to learn hash functions by preserving the ranking information, which is obtained based on the number of shared semantic labels between images. Supervised Discrete Hashing (SDH)  is proposed to leverage label information to obtain hash codes by integrating hash code generation and classifier training.
Although aforementioned unsupervised and supervised methods have achieved considerable progress, the image representations of these methods are hand-crafted features (e.g. GIST , Bag-of-Visual-Words ), which can not well represent the images’ semantic information. Inspired by the successful applications of deep networks on image classification and object detection , some deep hashing methods have been proposed recently to take advantage of the superior feature representation power of deep networks. Convolutional Neural Network Hashing (CNNH)  is a two-stage framework, which is designed to learn the fixed hash codes in the first stage by preserving the pairwise semantic similarity, and learn the hash functions based on the learned hash codes in the second stage. Although the learned hash codes can guide the feature learning, the learned features cannot give feedback for learning better hash codes. To overcome the shortcomings of the two-stage learning scheme, some approaches have been proposed to perform simultaneously image feature and hash code learning. Lai et al. propose Network In Network Hashing (NINH)  to use a triplet ranking loss  to capture the relative similarities of images. NINH is a one-stage supervised method, thus the image representation learning and hash code learning can benefit each other in the deep architecture. Some similar ranking-based deep hashing methods [29, 30, 31] have been proposed recently, which are also designed to preserve the ranking information obtained by labels. Besides the triplet ranking based methods, some pairwise based deep hashing methods [32, 33] are also proposed, which try to preserve the semantic similarities provided by pairwise labels.
Although deep hashing methods have achieved promising results on image retrieval, existing deep hashing methods [27, 2, 30] treat all hash bits equally. However, Hamming distances are discrete integers, so there are often a large number of images sharing the equal Hamming distances to a query, which raises a critical issue of image retrieval where fine-grained rankings are very important. An example is illustrated in Figure 1, given a query image with hash code , there can be 6 images with different hash codes that within Hamming radius 1 of query image, while they differ in different hash bits. Existing deep hashing methods cannot perform fine-grained ranking among them. However, if we know that which bit of hash codes is more important for query image, then we can return a better rankling list.
There exist several traditional hashing methods [34, 35, 36, 37, 38, 39] that learn weights for hash codes. However, these methods adopt two-stage frameworks, which first generate hash codes by other methods (e.g. LSH, ITQ), then learn hash weights by analyzing the fixed hash codes and image features. The two-stage scheme causes that the learned weights can’t give feedback for learning better hash codes, which limits the retrieval accuracy. Thus we propose the query-adaptive deep weighted hashing (QaDWH) method, which can not only learn hash codes and corresponding class-wise weights jointly, but also perform effective yet efficient query-adaptive fine-grained retrieval. To the best of our knowledge, this is the first deep hashing method that can perform query-adaptive fine-grained ranking. The main contributions of this paper can be concluded as follows:
A novel deep hashing network is designed to learn hash functions and corresponding weights jointly. In the proposed deep network, a hash layer and a class-wise weight layer are designed, of which the hash layer generates hash codes, while the class-wise weight layer learns the class-wise weights for different hash bits. On the top of the hash stream, a weighted triplet ranking loss is proposed to preserve the similarity constraints. With the trained deep network, we can not only generate the binary hash codes, but also weigh the importance of each bit for different image class.
A query-adaptive retrieval method is proposed to perform fine-grained retrieval. Query image’s bit-wise hash weights are first rapidly generated by fusing the learned class-wise weights and the predicted query class probability, so that generated weights can reflect different query’s semantic property. Based on the weights and generated hash codes, weighted Hamming distance measurement is employed to perform fine-grained rankings for different query images.
Extensive experiments on four datasets show that the proposed approach achieves the best retrieval accuracy comparing to eight state-of-the-art hashing methods. The rest of this paper is organized as follows. Section II briefly reviews the related work, section III presents the proposed deep weighted hashing method, section IV shows the experiments on four widely used image datasets, and section V concludes this paper.
Ii Related work
In this section, we briefly review related work, including some deep hashing methods proposed recently, and traditional weighted hashing methods.
Ii-a Deep Hashing Methods
Convolutional Neural Network Hashing (CNNH)  is the first deep hashing method based on convolutional neural networks (cnns). CNNH is composed of two stages: a hash code learning stage and a hash function learning stage. Given a training image set , in the hash code learning stage (Stage 1), CNNH learns approximate hash codes for training images by optimizing the following loss function:
where denotes the Frobenius norm; denotes the semantic similarity of image pairs in , in which when image and are semantically similar, otherwise ; denotes the approximate hash codes. encodes the approximate hash codes for training images which preserve the pairwise similarities in . Equation (1) is difficult to directly optimize, thus CNNH firstly relaxes the integer constraints on and randomly initializes , then optimizes equation (1) using a coordinate descent algorithm, which sequentially or randomly chooses one entry in to update while keeping other entries fixed. Thus it is equivalent to optimizing the following equation:
where and denote the -th and the -th column of respectively. In the hash function learning stage (Stage 2), CNNH uses deep networks to learn image features and hash functions. Specifically, CNNH adopts the deep framework in  as its basic network, and designs an output layer with sigmoid activation to generate -bit hash codes. CNNH trains the designed deep network in a supervised way, in which the approximate hash codes learned in Stage 1 are used as the ground-truth. However, CNNH is a two-stage framework, where the learned deep features in Stage 2 cannot help to optimize the approximate hash code learning in Stage 1, which limits the performance of hash learning.
Different from the two-stage framework in CNNH , Network in Network Hashing (NINH)  performs image representation learning and hash code learning jointly. NINH constructs deep framework based on the Network in Network architecture , with a shared sub-network composed of several stacked convolutional layers to extract image features, as well as a divide-and-encode module encouraged by sigmoid activation function and a piece-wise threshold function to output binary hash codes. During the learning process, instead of generating approximate hash codes in advance, NINH utilizes a triplet ranking loss function to exploit the relative similarity of training images to directly guide hash learning:
where and specify the triplet constraint that image is more similar to image than to image based on image labels; denotes binary hash code, and denotes the Hamming distance. For easy optimization of equation (3), NINH applies two relaxation tricks: relaxing the integer constraint on binary hash code and replacing Hamming distance with Euclidean distance.
There are several extensions based on NINH, such as Bit-scalable Deep Hashing method  further manipulates hash code length by weighing each bit of hash codes. Deep Hashing Network (DHN)  additionally minimizes the quantization errors besides triplet ranking loss to improve retrieval precision. Deep semantic preserving and ranking-based hashing (DSRH)  introduces orthogonal constraints into triplet ranking loss to make hash bits independent. The above deep hashing methods treat all hash bits equally, which leads to a coarse ranking among images with the same hamming distance and achieves limited retrieval accuracy.
Ii-B Traditional Weighted Hashing Methods
Hamming distances are discrete integers that can not perform fine-grained ranking for those images sharing same distances with a query image. Some hash bits weighting methods [34, 35, 36, 37, 38, 39] are proposed to address this issue. QaRank [35, 36] first learns class-specific weights by minimizing the intra-class similarity and maintaining the inter-class relations, then generates query-adaptive weights by using top k similar images’ labels. QsRank  designs a ranking algorithm for PCA-based hashing method, which uses the probability that -neighbors of query map to hash code to measure the ranking score of hash codes . WhRank  proposes a weighted Hamming distance ranking algorithm by data-adaptive weight and query-sensitive bitwise weight. QRank [34, 37] learns query-adaptive weights by exploiting both the discriminative power of each hash function and their complement for nearest neighbor search. The aforementioned traditional weighted hashing methods are all two-stage schemes, which take hash codes generated by other methods (such as LSH, SH and ITQ) as input, then learn the weights by analyzing the fixed hash codes and image features. The two-stage scheme causes that the learned weights can’t give feedback for learning better hash codes, which limits the retrieval accuracy.
Iii Query-adaptive deep weighted hashing
Given a set of images . The goal of hashing methods is to learn a mapping function , which encodes an image into a q-dimensional binary code in the Hamming space, while preserving the semantic similarity of images. In this section, we will introduce the proposed query-adaptive deep weighted hashing (QaDWH) approach. The overall framework is shown in Figure 2, the proposed deep hashing network consists of two streams, namely the hash stream and the classification stream. In the training stage, the hash stream learns the hash functions and the associated weights, while the classification stream preserves the semantic information. In the query stage, the trained network generates compact hash codes and the bit-wise weights for the newly input query images, and then the fine-grained ranking can be performed by the weighted Hamming distance measurement efficiently. In the following of this section, we’ll first introduce the proposed deep hashing network and training algorithm, then we’ll demonstrate the query-adaptive image retrieval method.
Iii-a Deep Weighted Hashing Network
As shown in Figure 2, the proposed deep network is composed of the representation learning layers, the hash stream and the classification stream. The representation learning layers serve as a feature extractor, which is a deep network composed of several convolutional layers and fully connected layers. We adopt the VGG-19 network  as the representation learning layers, in which the first 18 layers follow exactly the same settings in the VGG-19 network. The hash stream and the classification stream are both connected with the representation learning layers.
Iii-A1 The Hash Stream
The hash stream is composed of two layers, the hash code learning layer and the class-wise weight layer. The hash code learning layer is a fully connected layer with q neural nodes, its outputs are hash codes defined as:
where is the deep features extracted from the representation learning layers, and are the parameters in the hash code learning layer. Through the hash code learning layer, the image features are mapped into . Since the hash codes are continuous real values, we apply a thresholding function to obtain binary codes:
In order to learn the class-wise hash weights, we design a class-wise weight layer connected with the hash code learning layer. The class-wise weight layer is an element-wise layer, which is also associated with image classes. Suppose the number of image classes is c and the hash code length is q, then the class-wise weight layer is defined as an element-wise layer with parameters . And the output of class-wise weight layer is defined as:
where is the output hash codes of , is the image class index of , and denotes the element wise product. Here we constraint the weights to be nonnegative. For training images with multiple class, we use average fusion of corresponding weights to perform element wise product. Through the class-wise weight layer, the hash codes of each image are multiplied by its corresponding weights associated with image class.
On the top of the class-wise weight layer, we propose a weighted triplet ranking loss to train the hash stream. For the training images , where are the corresponding image labels. We sample a set of triplet tuples depending on the labels, , in which and are two similar images with the same labels, while and are two dissimilar images with different labels, is the number of sampled triplet tuples. For the triplet tuple , the weighted triplet ranking loss is defined as:
where the constant parameter defines the margin difference metric between the relative similarity of the two pairs and . That is to say, we expect the distance of the dissimilar pair to be larger than the distance of the similar pair by at least . denotes the weighted Hamming distance defined as:
where is the class index of image . Note that in the weighted triplet ranking loss, the weights of anchor point are used to calculate the weighted Hamming distance. Because anchor point acts like query in the retrieval process, we treat anchor point’s weights more importantly. Minimizing can reach our goal to preserve the semantic ranking constraints provided by labels.
In equation (7), the binary hash code and Hamming distance make it hard to directly optimize. Similar to NINH , binary hash code is relaxed with continuous real value hash code . Hamming distance is replaced by weighted Euclidean distance defined as:
Then equation (7) can be rewritten as:
Iii-A2 The Classification Stream
Besides the hash stream, we also design a classification stream connected with the representation layers. On one hand, jointly training the hash stream and the classification stream can improve the retrieval accuracy, which has been shown in previous work . On the other hand, the trained classification stream can be used to generate the query-adaptive hash code weights, which will be introduced in next part. In the classification stream, a fully connected layer with neural nodes is connected with the representation learning layers, which predicts the probability of each class. Then softmax loss is used to train the classification stream:
where are parameters of the network, is the number of images in one batch, and denotes whether image belongs to class . Note that this is not a standard softmax loss, but a multilabel softmax loss, which can handle images with multiple labels. When only one element of is , the above equation is equal to standard softmax loss. Incorporating the hash stream and the classification stream, the network can not only preserve the ranking information and semantic information, but also learn the bit-wise hash weights for different image classes.
Iii-A3 Network Training
Forward and backward propagation schemes are used in the training phase. For the two streams in the network, we use a co-training method to tune the network parameters jointly. More specifically, in the forward propagation stage, the ranking error in the hash stream is measured by equation (10), and the classification error in the classification stream is measured by equation (11). Then in the backward propagation stage, the network parameters are tuned by the gradient of each loss function. For the weighted triplet based ranking loss, the gradient with respect to , and are computed as:
Where is an indicator function, if is true, otherwise . Then the gradient of each image is fed into the network to update parameters of each layer, including the hash layer and the weight layer.
For the softmax loss, the gradient with respect to is calculated as:
By equations (12) and (13), these derivative values can be fed into the network via the back-propagation algorithm to update the parameters of each layer in the deep network. The training procedure is ended until the loss converges or a predefined maximal iteration number is reached. We briefly summarize the training process in Algorithm 1. Note that after the network is trained, we can not only get the hash mapping functions, but also the hash weights associated with each bit.
Iii-B Query-adaptive Image Retrieval
In the query stage, existing deep hashing methods [27, 2, 30] treat each hash bit equally, and they usually first map query image to binary hash codes and retrieve the images in the database by Hamming distance. However, Hamming distances are discrete values, which can not perform fine-grained ranking since a large amount of images may share the same distance to a query image. To address this issue, we propose the query-adaptive image retrieval approach. For a given query image , we first generate real valued hash codes by the output of hash layer, then the binary codes are generated by equation (5).
In order to perform query-adaptive fine-grained ranking, besides the hash codes, we also generate query-adaptive hash weights efficiently. Based on the trained network, we already obtain class-wise hash bit weights for different image classes. The query-adaptive weights are generated rapidly as:
where is the predicted probability generated by the classification stream, in which indicates the probability that belongs to image class . Equation 14 means that we fuse the class-wise weighs by the probability of query belongs to each class, thus the generated hash bit weights can reflect the semantic property of the query image. Fine-grained image ranking can be performed by the weighted Hamming distance between the query and any image in the database:
where is the length of hash codes. We summarize the query-adaptive retrieval method in algorithm 2. Note that the proposed query-adaptive image retrieval method is also very fast compared to original Hamming distance measurement. Equation 14 is a simple matrix multiplication which is efficient. And in practice, the weighted Hamming distance only needs to be computed in a subset of hash codes, since we can firstly sort with the Hamming distance fast by operation, and then compute the weighted distance in a subset within small Hamming distance (e.g. Hamming distance ). Thus the additional computation is very small compared to original Hamming distance ranking, and effective yet efficient fine-grained ranking can be performed.
In this section, we will introduce our experiments conducted on four widely used datasets, which are CIFAR10, NUS-WIDE, MIRFLICKR and ImageNet datasets. We compare with eight state-of-the-art methods in terms of retrieval accuracy and efficiency to verify the effectiveness of our QaDWH approach. In addition, we also conduct baseline experiments to verify the separate contribution of proposed deep weighted hashing and query-adaptive retrieval.
Iv-a Datasets and Experimental Settings
We conduct experiments on four widely used image retrieval datasets. Each dataset is split into query, database and training set, we summarize the split of each dataset in Table I, and detailed settings are as follows:
CIFAR10 dataset consists of 60,000 color images from 10 classes, each of which has 6,000 images. Following [2, 27], 1,000 images are randomly selected as the query set (100 images per class). For the unsupervised methods, all the rest images are used as the training set. For the supervised methods, 5,000 images (500 images per class) are further randomly selected from the rest of images to form the training set.
NUS-WIDE  dataset contains nearly 270,000 images, each image is associated with one or multiple labels from 81 semantic concepts. Following [2, 27], only the 21 most frequent concepts are used, where each concept has at least 5,000 images, resulting in a total of 166,047 images. 2,100 images are randomly selected as the query set (100 images per concept). For the unsupervised methods, all the rest images are used as the training set. For the supervised methods, 500 images from each of the 21 concepts are randomly selected to form the training set of total 10,500 images.
MIRFLICKR  dataset consists of 25,000 images collected from Flickr, and each image is associated with one or multiple labels of 38 semantic concepts. 1,000 images are randomly selected as the query set. For the unsupervised methods, all the rest images are used as the training set. For the supervised methods, 5,000 images are randomly selected from the rest of images to form the training set.
ImageNet  dataset contains 1000 categories with 1.2 million images. ImageNet is a large dataset that can comprehensively evaluate the proposed approach and compared methods. Since the testing set of ImageNet is not publicly available, following , we use the provided training set as the retrieval database, and the validation set as query set. For the training set of each hashing methods, we further randomly sampling 50,000 images from retrieval database as the training set (50 image per class).
Iv-B Evaluation Metrics and Compared Methods
To objectively and comprehensively evaluate the retrieval accuracy of the proposed approach and the compared methods, we use 5 evaluation metrics: Mean Average Precision (MAP), Precision-Recall curves, precision curves of top k retrieved samples, precision within top 500 retrieved samples and precision within Hamming radius 2. The definitions of these evaluation metrics are defined as follows:
Mean Average Precision (MAP): MAP presents an overall measurement of the retrieval performance. MAP for a set of queries is the mean of average precision (AP) for each query, where AP is defined as:
where n is the size of database set, R is the number of relevant images with query in database set, is the number of relevant images in the top k returns, and if the image ranked at k-th position is relevant and 0 otherwise.
Precision-Recall curves: The precisions at certain level of recall, we calculate Precision-Recall curves of all returned results.
Precision curves of top k retrieved samples: The average precision of top k returned images for each query.
Precision within top 500 retrieved samples: The average precision of the top 500 returned image for each query.
Precision within Hamming radius 2: Precision curve of returned images with the Hamming distance smaller than 2 using hash lookup.
We compare the proposed QaDWH approach with eight state-of-the-art methods, including unsupervised methods LSH, SH and ITQ, supervised methods SDH, CNNH, NINH and DRSCH, and traditional query-adaptive hashing method QRank. The brief introductions of these 8 methods are listed below:
LSH  is a data independent unsupervised method, which uses randomly generated hash functions to map image features into binary codes.
SH  is a data dependent unsupervised method, which learns hash functions by making hash codes balanced and uncorrelated.
ITQ  is also a data dependent unsupervised method, which learns hash functions by minimizing the quantization error of mapping data to the vertices of a binary hypercube.
SDH  is a supervised method, which leverages label information to obtain hash codes by integrating hash code generation and classifier training.
CNNH  is a two-stage deep hashing method, which learns hash codes for training images in first stage, and trains a deep hashing network in second stage.
NINH  is a one-stage deep hashing method, which learns deep hashing network by a triplet loss function to measure the ranking information provided by labels.
DRSCH  is also a triplet loss based deep hashing method, which can further leverage hash code length by weighing each bit of hash codes.
QRank  is a traditional query-adaptive hashing method, which learns query-adaptive hash weights by exploiting both the discriminative power of each hash function and their complement for nearest neighbor search. QRank is state-of-the-art query-adaptive hashing method, which outperforms other weighted hashing methods (e.g. QsRank , WhRank ).
Iv-C Implementation Details
We implement the proposed approach based on the open-source framework Caffe . The parameters of the first 18 layers in our network are initialized with the VGG-19 network , which is pre-trained on the ImageNet dataset . Similar initialization strategy has been used in other deep hashing methods [29, 33]. For the weight layer, we initialize the weights with all , because we treat each bit equally in the beginning of training. In all experiments, our network is trained with the initial learning rate of 0.001, we decrease the learning rate by 10 every 20,000 steps. And the mini-batch size is 64, the weight decay parameter is 0.0005. For the only parameter in our proposed loss function, we set in all the experiments.
For the proposed QaDWH, and compared methods CNNH, NINH and DRSCH, we use raw image pixels as input. The implementations of CNNH and DRSCH are provided by their authors, while NINH is of our own implementation. Since the representation learning layers of CNNH, NINH and DRSCH are different from each other, for a fair comparison, we use the same VGG-19 network as the base structure for deep hashing methods. And the network parameters of all the deep hashing methods are all initialized with the same pre-trained VGG-19 model, thus we can perform fair comparison between them. The results of CNNH, NINH and DRSCH are referred as CNNH, NINH and DRSCH respectively.
For the query-adaptive method QRank, which uses image features and hash codes generated by other hashing methods as input. In order to compare QRank with proposed QaDWH approach fairly, we use the hash codes and features generated by deep hashing method NINH as the input of QRank, thus we denote the result of QRank as NINH-QRank. The implementation of QRank is provided by the author.
For other compared traditional methods without deep networks, we represent each image by hand-crafted features and deep features respectively. For hand-crafted features, we represent images in the CIFAR10 and MIRFLICKR by 512-dimensional GIST features, and images in the NUS-WIDE by 500-dimensional bag-of-words features. For a fair comparison between traditional methods and deep hashing methods, we also conduct experiments on the traditional methods with the features extracted from deep networks, where we extract 4096-dimensional deep feature for each image from the pre-trained VGG-19 network. We denote the results of traditional methods using deep features by LSH-VGG19, SH-VGG19, ITQ-VGG19 and SDH-VGG19. The results of SDH, SH, and ITQ are obtained from the implementations provided by their authors, while the results of LSH are from our own implementation.
Iv-D Experiment Results and Analysis
Iv-D1 Experiment results on CIFAR10 dataset
Table II shows the MAP scores with different length of hash codes on CIFAR10 dataset. Overall, the proposed QaDWH achieves the highest average MAP of 0.880, and consistently outperforms state-of-the-art methods on all hash code lengths. More specifically, compared with the highest deep hashing methods DRSCH, which achieves average MAP of 0.843, the proposed QaDWH has an absolute improvement of 0.037. Compared with the highest traditional methods using deep features SDH-VGG19, which achieves an average MAP of 0.600, the proposed method has an absolute improvement of 0.280. While the highest traditional methods using hand-crafted features SDH achieves average MAP of 0.322, the proposed approach has an improvement of 0.558. And compared with the traditional weighted hashing method QRank, which achieves an average MAP of 0.822, the proposed QaDWH has an absolute improvement of 0.058. It’s because QaDWH benefits from the joint training of hash codes and corresponding class-wise weights, while QRank can only learn the weights but cannot give feedback for learning better hash codes.
Figure 3(a) shows the precisions within Hamming radius 2 using hash lookup. The precision of proposed QaDWH consistently outperforms state-of-the-art methods on all hash code length, because QaDWH benefits from the joint training scheme and can generate better hash codes. The precision within top 500 retrieved samples is shown in Figure 3(b), the proposed QaDWH still achieves the highest precision, which demonstrates the effectiveness of fine-grained ranking. Figure 3(c) shows the precision curves of different number of retrieved samples on 48bit hash code, and the proposed QaDWH achieves the highest accuracy. Figure 3(d) demonstrates the precision-recall curves using Hamming ranking with 48bit codes. QaDWH still achieves the best accuracy on all recall levels, which further shows the effectiveness of proposed approach.
Iv-D2 Experiment results on NUS-WIDE dataset
Table III shows the MAP scores with different length of hash codes on NUS-WIDE dataset. Following [2, 27], we calculate the MAP scores based on top 5000 returned images. Similar results on NUS-WIDE can be observed, the proposed QaDWH still achieves the best MAP scores (average 0.878). QaDWH achieves an absolute improvement of 0.053 on average MAP compared to the highest deep hashing methods DRSCH (average 0.825). Compared with the highest traditional method using deep features SDH-VGG19, which achieves an average MAP of 0.794, QaDWH has an absolute improvement of 0.084. It is also interesting to observe that with the deep features extracted from VGG-19 network, the traditional method SDH achieves comparable results with deep hashing methods. And compared with QRank (average 0.829), the proposed QaDWH still achieves an absolute improvement of 0.049, which shows that the proposed QaDWH method has the advantage of joint training hash code and corresponding weights.
Figure 4 (a), (b), (c) and (d) demonstrate the retrieval accuracy on NUS-WIDE. Similarly, the proposed QaDWH achieves the best accuracy on the 4 evaluation metrics, due to the joint training scheme and the fine-grained ranking for different queries.
Iv-D3 Experiment results on MIRFLICKR dataset
The MAP scores with different length of hash codes on MIRFLICKR dataset are shown in Table IV. The proposed QaDWH method achieves average MAP score of 0.800, which outperforms other deep hashing methods DRSCH (0.783), NINH (0.766) and CNNH (0.755). Compared with the highest traditional method using deep features SDH-VGG19, which achieves the average MAP of 0.739, QaDWH has an absolute improvement of 0.061. On MIRFLICKR, the proposed QaDWH method still outperforms traditional weighted hashing method QRank by 0.029, which shows the effectiveness of jointly training of hash codes and corresponding weights. Figure 5(a) shows the precision within Hamming radius 2 using hash lookup, from which we can observe that the proposed QaDWH approach achieves the best result. Figure 5(b) shows the precision curves within top 500 retrieved samples, and QaDWH achieves the highest precision due to the fine-grained retrieval. Figure 5(c) and (d) demonstrate the top 1k results and Precision-Recall curve on 48bit hash code, the proposed QaDWH method still achieves the best results, which further shows the effectiveness of query-adaptive fine-grained ranking.
Iv-D4 Experiment results on ImageNet dataset
The MAP scores with different length of hash codes on ImageNet dataset are shown in Table V, note that for this large scale dataset, we only report results of traditional methods using deep features. And for this large dataset, we calculate the MAP scores based on top 500 returned images due to the high computation cost of MAP evaluation. From Table V we can observe that the proposed QaDWH approach achieves best average MAP score of 0.211 on this challenging dataset. And compared with traditional weighted hashing method QRank, our proposed QaDWH achieves an absolute improvement of 0.048, and QRank cannot achieve stable improvements over NINH on this large dataset. Compared with the best deep hashing methods NINH on ImageNet dataset, the proposed QaDWH achieves an absolute improvement of 0.043. And on this large dataset, we can observe that traditional methods like SDH and ITQ achieve comparable results with deep hashing methods. Figure 6 (a), (b), (c) and (d) demonstrate the retrieval accuracy on ImageNet. Similarly, the proposed QaDWH achieves the best accuracy on these four evaluation metrics, due to the joint training scheme of hash functions and corresponding weights and the fine-grained ranking for different query images.
Iv-D5 Comparison of Testing Time
Besides the comparison of retrieval accuracy between different methods, we also compare the testing time of proposed approach and state-of-the-art methods. All the experiments are conducted on the same PC with NVIDIA Titan Black GPU, Intel Core i7-5930k 3.50GHz CPU and 64 GB memory. Typical retrieval process of hashing methods generally consists of three stages: Feature extraction, hash code generation and image retrieval among databases. We record time costs of each stage for different methods, the final testing time cost is the sum of three stages. Note that proposed QaDWH approach and other deep hashing methods are end-to-end frameworks, whose input are raw images and output are hash codes, while compared traditional hashing methods use image features as input. Thus for fair comparison, we use deep features for traditional methods. And for compared query-adaptive hashing method QRank, which uses image features and hash codes generated by other methods as input, its additional computation is query-adaptive weights calculation. The average testing time of different methods is shown in Table VI, we perform each hashing methods 5 times to calculate the average testing time. Comparing proposed QaDWH approach with other deep hashing methods, we can observe that QaDWH is a little slower but still comparable (less than 1 millisecond for small scale dataset, less than 10 milliseconds for large scale ImageNet dataset), which is expected since QaDWH uses relatively slower weighted Hamming distance. However, proposed QaDWH can achieve much better retrieval accuracy by a little time costs. Comparing proposed QaDWH with traditional query-adaptive method QRank, we can observe that proposed QaDWH is much faster than QRank, it’s because QRank consumes much more time to calculate query-adaptive hash weights, while proposed QaDWH approach costs only a simple matrix multiplication to calculate query-adaptive weights. From the result table we can also observe that, the deep hashing methods and traditional hashing methods are comparable with each other in terms of testing time, since the time cost of hash code generation is only a matrix multiplication which is very fast, and all of them use Hamming distance that can be efficiently calculated by bit-wise XOR operation.
Iv-E Baseline Experiments and Analysis
We also conduct two baseline experiments to further demonstrate the separate contributions of proposed deep weighted hashing and query-adaptive retrieval approach: (1) To verify the effectiveness of query-adaptive retrieval approach, we further perform experiments of using fixed weights by averaging learned class-wise weights, thus each query has the same hash code weights, we denote results of this baseline method as DWH. (2) To verify the effectiveness of deep weighted hashing, we further conduct experiments without using hash weights at all, which is equivalent to NINH method, we denote the results of NINH as NINH. The MAP scores of baseline methods are shown in Table VII. From the result table, we can observe that on all the four datasets, the DWH method outperforms NINH, which shows that the learned class-wise weights can reflect the semantic property of different image classes, thus improve the retrieval accuracy. And QaDWH further outperforms DWH on all four datasets, which demonstrates that the query-adaptive image retrieval approach can further improve the retrieval accuracy. Figure 7 to 10 show other four evaluation metrics on CIFAR10, NUS-WIDE, MIRFLICKR and ImageNet datasets. From those figures we can clearly observe that DWH outperforms NINH and QaDWH outperforms DWH on those four evaluation metrics, which further demonstrate the effectiveness of proposed deep weighted hashing and query-adaptive retrieval approach.
In this paper, we have proposed a novel query-adaptive deep weighted hashing (QaDWH) approach. First, we design a new deep hashing network, which consists of two streams: the hash stream learns the compact hash codes and corresponding class-wise hash bit weights simultaneously, while the classification stream preserves the semantic information and improves hash performance. Second, we propose an effective yet efficient query-adaptive image retrieval approach, which first rapidly generates the query-adaptive hash weights based on the query’s predicted semantic probability and class-wise weights, and then performs effective image retrieval by weighted Hamming distance. Experiment results show the effectiveness of QaDWH compared with eight state-of-the-art methods on four widely used datasets. In the future work, we intend to extend the deep weighted hashing scheme to a multi-table deep hashing framework, in which different weights are learned for different hash mapping functions.
-  A. Gionis, P. Indyk, R. Motwani et al., “Similarity search in high dimensions via hashing,” in International Conference on Very Large Data Bases (VLDB), vol. 99, no. 6, 1999, pp. 518–529.
-  H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3270–3278.
-  J. Wang, S. Kumar, and S.-F. Chang, “Sequential projection learning for hashing with compact codes,” in International Conference on Machine Learning (ICML), 2010, pp. 1127–1134.
-  Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Annual Conference on Neural Information Processing Systems (NIPS), 2009, pp. 1753–1760.
-  Z. Chen, J. Lu, J. Feng, and J. Zhou, “Nonlinear discrete hashing,” IEEE Transactions on Multimedia (TMM), vol. 19, no. 1, pp. 123–135, 2017.
-  P. Li, M. Wang, J. Cheng, C. Xu, and H. Lu, “Spectral hashing with semantically consistent graph for image indexing,” IEEE Transactions on Multimedia (TMM), vol. 15, no. 1, pp. 141–152, 2013.
-  M. Kafai, K. Eshghi, and B. Bhanu, “Discrete cosine transform locality-sensitive hashes for face retrieval,” IEEE Transactions on Multimedia (TMM), vol. 16, no. 4, pp. 1090–1103, 2014.
-  Y. Zhang, L. Zhang, and Q. Tian, “A prior-free weighting scheme for binary code ranking,” IEEE Transactions on Multimedia (TMM), vol. 16, no. 4, pp. 1127–1139, 2014.
-  V. E. Liong, J. Lu, Y.-P. Tan, and J. Zhou, “Deep video hashing,” IEEE Transactions on Multimedia (TMM), 2016.
-  Y. Hao, T. Mu, R. Hong, M. Wang, N. An, and J. Y. Goulermas, “Stochastic multiview hashing for large-scale near-duplicate video retrieval,” IEEE Transactions on Multimedia (TMM), vol. 19, no. 1, pp. 1–14, 2017.
-  K. Ding, B. Fan, C. Huo, S. Xiang, and C. Pan, “Cross-modal hashing via rank-order preserving,” IEEE Transactions on Multimedia (TMM), vol. 19, no. 3, pp. 571–585, 2017.
-  D. Wang, P. Cui, M. Ou, and W. Zhu, “Learning compact hash codes for multimodal representations using orthogonal deep structure,” IEEE Transactions on Multimedia (TMM), vol. 17, no. 9, pp. 1404–1416, 2015.
-  M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes from shift-invariant kernels,” in Annual Conference on Neural Information Processing Systems (NIPS), 2009, pp. 1509–1517.
-  Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe lsh: efficient indexing for high-dimensional similarity search,” in International conference on Very large data bases (VLDB), 2007, pp. 950–961.
-  W. Liu, J. Wang, S. Kumar, and S.-F. Chang, “Hashing with graphs,” in International Conference on Machine Learning (ICML), 2011, pp. 1–8.
-  Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean approach to learning binary codes,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 817–824.
-  L. Zhang, Y. Zhang, J. Tang, X. Gu, J. Li, and Q. Tian, “Topology preserving hashing for similarity search,” in ACM International Conference on Multimedia (ACM-MM), 2013, pp. 123–132.
-  G. Irie, Z. Li, X.-M. Wu, and S.-F. Chang, “Locally linear hashing for extracting non-linear manifolds,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 2115–2122.
-  B. Kulis and T. Darrell, “Learning to hash with binary reconstructive embeddings,” in Annual Conference on Neural Information Processing Systems (NIPS), 2009, pp. 1042–1050.
-  W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, “Supervised hashing with kernels,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2074–2081.
-  J. Wang, J. Wang, N. Yu, and S. Li, “Order preserving hashing for approximate nearest neighbor search,” in ACM International Conference on Multimedia (ACM-MM), 2013, pp. 133–142.
-  Q. Wang, Z. Zhang, and L. Si, “Ranking preserving hashing for fast similarity search,” in International Joint Conference on Artificial Intelligence (IJCAI), 2015, pp. 3911–3917.
-  F. Shen, C. Shen, W. Liu, and H. Tao Shen, “Supervised discrete hashing,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 37–45.
-  A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International Journal of Computer Vision (IJCV), vol. 42, no. 3, pp. 145–175, 2001.
-  L. Fei-Fei and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 524–531.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Annual Conference on Neural Information Processing Systems (NIPS), 2012, pp. 1097–1105.
-  R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing for image retrieval via image representation learning,” in AAAI Conference on Artificial Intelligence (AAAI), 2014, pp. 2156–2162.
-  M. Schultz and T. Joachims, “Learning a distance metric from relative comparisons,” in Advances in Neural Information Processing Systems (NIPS), 2003, pp. 41–48.
-  T. Yao, F. Long, T. Mei, and Y. Rui, “Deep semantic-preserving and ranking-based hashing for image retrieval,” in International Joint Conference on Artificial Intelligence (IJCAI), 2016, pp. 3931–3937.
-  R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang, “Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification,” IEEE Transactions on Image Processing (TIP), vol. 24, no. 12, pp. 4766–4779, 2015.
-  F. Zhao, Y. Huang, L. Wang, and T. Tan, “Deep semantic ranking based hashing for multi-label image retrieval,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1556–1564.
-  W.-J. Li, S. Wang, and W.-C. Kang, “Feature learning based deep supervised hashing with pairwise labels,” in International Joint Conference on Artificial Intelligence (IJCAI), 2016, pp. 1711–1717.
-  H. Zhu, M. Long, J. Wang, and Y. Cao, “Deep hashing network for efficient similarity retrieval,” in AAAI Conference on Artificial Intelligence (AAAI), 2016, pp. 2415–2421.
-  T. Ji, X. Liu, C. Deng, L. Huang, and B. Lang, “Query-adaptive hash code ranking for fast nearest neighbor search,” in ACM International Conference on Multimedia (ACM-MM), 2014, pp. 1005–1008.
-  Y.-G. Jiang, J. Wang, and S.-F. Chang, “Lost in binarization: query-adaptive ranking for similar image search with compact codes,” in ACM International Conference on Multimedia Retrieval (ICMR), 2011, pp. 16–22.
-  Y.-G. Jiang, J. Wang, X. Xue, and S.-F. Chang, “Query-adaptive image search with hash codes,” IEEE Transactions on Multimedia (TMM), vol. 15, no. 2, pp. 442–453, 2013.
-  X. Liu, C. Deng, B. Lang, D. Tao, and X. Li, “Query-adaptive reciprocal hash tables for nearest neighbor search,” IEEE Transactions on Image Processing (TIP), vol. 25, no. 2, pp. 907–919, 2016.
-  L. Zhang, Y. Zhang, J. Tang, K. Lu, and Q. Tian, “Binary code ranking with weighted hamming distance,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 1586–1593.
-  X. Zhang, L. Zhang, and H.-Y. Shum, “Qsrank: Query-sensitive hash code ranking for efficient-neighbor search,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2058–2065.
-  G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
-  M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2014.
-  T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nus-wide: a real-world web image database from national university of singapore,” in ACM international conference on image and video retrieval (CIVR), 2014, p. 48.
-  M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,” in ACM International Conference on Multimedia Information Retrieval (MIR), 2008, pp. 39–43.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACM International Conference on Multimedia (ACM-MM), 2014, pp. 675–678.